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Abstract 



Word bigram models estimated from text corpora require smoothing methods to estimate the prob- 
abilities of unseen bigrams. The deleted estimation method uses the formula: 



where fi and fiy are the relative frequency of i and the conditional relative frequency of i given j, 
respectively, and A is an optimized parameter. MacKay (1994) proposes a Bayesian approach using 
Dirichlet priors, which yields a different formula: 



where Fj is the count of j and a and nii are optimized parameters. This thesis describes an 
experiment in which the two methods were trained on a two-million-word corpus taken from the 
Canadian Hansard and compared on the basis of the experimental perplexity that they assigned to 
a shared test corpus. The methods proved to be about equally accurate, with MacKay's method 
using fewer resources. 



Pr(i |i) = A/,+(l-A)/,|,, 
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1 Introduction 



Today, statistical techniques are finding wide application in natural language processing. They are 
used in speech processing and optical character recognition to distinguish between similar words, 
in parsing to disambiguate word senses and phrasal attachments, and in machine translation to 
choose natural-sounding translation equivalents, among other tasks. In all of these applications, the 
approach is to develop a statistical model from an existing body of data (the irammg corpus) and 
then to use that model to interpret new, previously unseen, data. 

Since natural language is somewhat unruly, completely accurate statistical models for these tasks 
would be very complex. In practice, people adopt simplifying assumptions about the statistical 
models underlying natural language, so that the models can easily be described and worked with 
mathematically. 

One very simple type of model is the n-gram model, in which the probability of the next symbol 
depends only on the n-1 symbols preceding it. Trigram models, where the probability of the yth 
symbol depends on the (j/— l)th and (j/— 2)th symbols, are frequently used, tractable in most cases, 
and provide reasonable results for many tasks. Bigram models, in which the yth symbol depends only 
on the (j/— l)th symbol, while less powerful than trigram models, are also simpler mathematically, 
and are useful for exploring new methods and theories. I use bigram models of word sequences in 
text as the basis for the comparisons in this thesis. 

Once the basic model is chosen, there may be a number of parameters to estimate. In n-gram 
models, the conditional probabilities of each possible symbol given the n-1 previous symbols must 
be found. A simple solution would be to use the conditional relative frequencies of n-grams in 
the training corpus as predictive probabilities. There is a problem with using relative frequencies, 
however: there might be "holes" in the training data. Suppose a certain n-gram never occurs in 
the training data. Then the conditional relative frequency of that n-gram is zero. A predictive 
probability of zero suggests that the n-gram cannot possibly occur, but it is not reasonable to 
conclude that an n-gram is impossible simply because it did not occur in the training data. 

One could try to avoid the problem by using enough training data to include every possible 
legitimate combination. But how much training data is enough? There might even be grammatical 
combinations of words that have never been uttered yet. 

What is needed is a way of assigning tiny positive probabilities to possible sequences of symbols 
that don't appear in the training data, while maintaining the predictive power and utility of the 
entire distribution. This is called smoothing the distribution. A number of smoothing methods have 
already been developed for natural language problems. When a new smoothing method is proposed, 
it is useful to know how it compares with other existing smoothing techniques. The point of this 
thesis is to compare two existing smoothing methods for word bigram models experimentally. 

Different smoothing techniques produce different probability distributions that function as rough 
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approximations of the "true" model underlying language. But since the "true" model, if it exists, 
is unknown, how can the "goodness" of a smoothing method be measured? In this thesis, I adopt 
a utilitarian perspective. Remember that these models are being developed in order to identify the 
mostly likely interpretation of some future data. Therefore, I compare the smoothing methods on the 
basis of how well the resulting models identify the correct interpretation of some previously unseen 
data: that is, how probable the correct interpretation is under each model. In the word bigram 
models used in this thesis, the "interpretation" is simply the word order, and the best model is the 
one that assigns highest probability to the word orders actually occurring in some previously unseen 
sentences. A readable explanation of the theoretical basis for this approach to model comparison 
can be found in MacKay (1992). Of course, in determining the utility of a smoothing method for 
real applications, it is also necessary to consider the resources required to develop the model, and I 
will discuss this as well. 

2 Background 

2.1 Word Bigram Models 

Under a word bigram model, a sentence is generated by first randomly choosing an initial word, 
and then randomly choosing the next word with probabilities based on the previous word, until a 
complete sentence is constructed. A text is made up of many sentences so generated. By examining 
a generated text statistically, one attempts to estimate the conditional probability of each different 
word pair in the underlying bigram model that generated it. 

In this thesis, the training corpus (the generated text) is made up of n sentences labelled 
si,...,s„. Each sentence is made up of len(sx) tokens labelled . . . , Wj;_;en(s^)- A token 

can be a word, a punctuation mark, or a special symbol. The first token of each sentence is neces- 
sarily <s>, and the last token is necessarily </s>. The probability of a single sentence in a bigram 
model is 

Pr(sj,) = Pr(wj;_i) Y[ Pr(wj;_j, | w^^y^i). (1) 

y = 2 

But since Pr(wj;_i = < s >) = 1, this simplifies to: 

Pr(sj;) = Y\. I '^x,y-i)- (2) 

y = 2 

The probability of the entire training corpus is then: 

Pr(si, . . . , s„) = ]^ Y[ Pr(wj;_j, I Wj;_j,_i). (3) 

x=l y=2 
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The goal is to find tlie predictive conditional probability matrix Q, where each element 
q^j = Vi{wx^y = i I Wx^y-i =])■ Q is a VF X VF matrix, where W is the number of distinct words in 
the vocabulary. There are an infinite number of possible assignments of values to the qi\j, and the 
goal is to get as close as possible to the "true" value of Q by examining the training data. 

Let Fi\j be the number of times the bigram ji occurs in the training data, and let Fj be the 
number of times token j occurs.^ The conditional relative frequency is 

/.b- = (4) 

Setting qi\j = fi\j produces the maximum likelihood estimator of Q. As I explained earlier, the 
maximum likelihood estimator of Q is inadequate because fi\j = for any bigram ji that does not 
occur in the training corpus, and a probability of zero is fatal in most applications. The following 
smoothing methods address this problem. 

2.2 Deleted Estimation 

Bahl, Jelinek, and Mercer (1983; Jelinek and Mercer 1980) describe a smoothing method for Markov 
models, which is usually called deleted esttmatton?'^ Details of how it works for Markov models can 
be found in their papers; I present a version for the special case of bigram models here. 

Let i and j be two words that appear in the training corpus. With the deleted estimation 
smoothing method, qi\j is a linear combination of the conditional relative frequency of i given j 
(fi\j) with the relative frequency of i (fi). 

qi\j = >^fi + (1 - (5) 

The proportions of the mixture are governed by a parameter A. A is between and 1, and it is 
estimated from the training data to maximize the predictive power of the resulting distribution. 
The idea is that, while there might not be sufficient occurrences of bigrams starting with j for the 
conditional relative frequency to be an accurate estimate of the probability of i occurring after j, the 
relative frequency of i is a good estimator of the probability of i occurring at all. The combination is 
a neat formula that assigns plausible non-zero probabilities to all bigrams composed of words from 
the training corpus. 

Bahl, Jelinek, and Mercer (1983) actually use several different As depending on the relative 



^ In general, Fj = ^ . F^^j . For j =< / s >, ^ ■ ^t\</ s> — turns out tnat ^/ s> 

^The method is also sometimes called deleted interpolation^ after the method used to find A. I will use the term 

"deleted estimation" to refer to the whole process of obtaining a probability distribution using this method, and the 

term "deleted interpolation" will be reserved for the sub-process of estimating A. 

There is also a later method called "deleted estimation" by the same authors, which is referred to by Church and 

Gale (1991). 
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frequency of j. When there are many occurrences of bigrams starting with j, fiy is more reliable, so 
a smaller A is used to include more of fiy in the mixture. Similarly, when there are few occurrences 
of bigrams starting with j, a larger A is used so that the mixture depends more on fi. These different 
As are set using deleted interpolation, which will be described shortly. 

To summarize the situation so far, we are looking for Q = {qi\j \ i,j = 1, • • • , W}. qiy depends 
on the relative frequency fi, the conditional relative frequency fiy, and A. fi and fiy can be 
obtained by counting occurrences in the training corpus, and A is about to be determined by deleted 
interpolation. This gives us all we need to predict the probability of any sentence that uses the 
vocabulary of the training corpus. 

To prepare the reader for the discussion of deleted interpolation, let us start by supposing that 
there is only one A, and show how the value of A can be found using a standard maximum likelihood 
estimation procedure. Substituting equation 5 into equation 3 yields the following expression for the 
probability of the training corpus: 

Pr(si,...,s„) = n n + (6) 

x=l y=2 



To simplify the notation, let 



and 



So equation 6 can be rewritten 



^X,y — fw^^y fw:c,y\W:c,y-l {'^ ) 

^X,y = fw:c, y\W:c, y-1 ■ i^) 



Pr(si, . . .,S„) = ]^ Y[ {>^Ux,y + Vx,y)- (9) 
x=l y=2 

The value of A that maximizes equation 9 will also maximize the log of equation 9. The log probability 
is 

m = j2 J2 ^Og{XU,,y+V,,y). (10) 

x=l y=2 

To find the value of A that maximizes the log probability, we differentiate, set the derivative equal 
to zero and solve for a root. The first and second derivatives of the log probability are 

AA) = E E ^7^^ (11) 

x = l y = 2 

n len(sx) 9 

t{X) = y y -- — — (12) 

f^, i>^Ux,y+V^,yf 
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Notice that £"(X) < always. Therefore any root of £'(X) = will yield a global maximum for 
£(X), subject to checking the endpoints A = and A = 1. So to find the value of A that maximizes 
Pr(si, . . .,Sn), we just find a root, A, of i"(A) = using any standard numerical method, such as bi- 
nary search, and then compare Pr(si | A = A), Pr(si , . . . , s„ | A = 0), and Pr(si | A = 1). 

However, Jelinek and Mercer (1980) point out that since A is intended to improve the ability 
of the distribution to predict new data, it must be estimated on data different from that used to 
calculate fi and fiy. Deleted interpolation is a method they devised to use all the available training 
data for estimating both A and fi and fiy, without actually using the same data for both. They 
divide the training data into a number, h, of blocks, and calculate h different frequency tables, each 
with a different block of data omitted from the counts. Then A is in effect estimated from the deleted 
blocks. 

To illustrate how deleted interpolation applies to the bigram model, let be the relative fre- 
quency of the word i calculated with block k of data omitted, and let f^j be the conditional relative 
frequency of the word i following word j, again calculated with block k of the data omitted. Similarly, 



and 



The probability of a corpus is 



The log probability is 



pr(si,...,s„)= n n n (i^) 

k = l s^ek^^ block y = 2 



^w = E E E iog(A<, + <,) (16) 

k = l s^Gfe"* block y = 2 

Equation 11 is replaced by 

AA) = E E E j^i^' (17) 

k = l s^Gfcf block y = 2 ^.f ^ ^.f 

and we can solve for a root of i"(A) = exactly as before. 

With this understanding of how deleted interpolation works for one A, let us consider the adjust- 
ments needed for the case where several As are used. Suppose we divide the range of possible relative 
frequencies (the interval [0, 1]) into r subintervals, with a separate A^ for each interval h = 1, . . . , r.'* 



do this by dividing the expected range of relative frequencies into r equal subintervals, and then including any 
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Whenever we calculate the conditional probability for a word i using 



q^j = M^ + (1 - (18) 

we use the for the interval containing fj. When this change is incorporated, the probability of 
the training data becomes 

h r 

Pr(si,...,s„)=n n n n (^ft<,+<,) (w) 

k = l s^Gfe"* block h = l 

2<.y <.len(s and 

eft"* interval 

The log probability is 



iiXu...,Xr) = J2 EE E log(A,<^+<^). (20) 

k = l sek^i^ block h = l 

2<.y <.len(s and 
/,';, eft"* interval 

The first and second derivatives of the log probability with respect to each are: 

2<.y<.Ien( s and 
f'* eh^'' interval 

A^.(A„...,A.) = E E E -TT-rvvY- (22) 

h fc = l sefc*'' block {■^h'^x,y + '"x,y) 

2<.y <.len(s and 

eft"* interval 

Since -^p-i(Xi, . . . , Xr) < for all h, it is safe to solve for each A^ separately, and then combine 

h 

the results. This can be done using binary search as before. Deleted estimation requires 0(N) 
computations per iteration, where N is the number of bigrams in the training corpus. 



2.3 MacKay's Bayesian Smoothing Method 

MacKay (1994) notes that "any rational predictive procedure can be made Bayesian", and proposes 
that by making some assumptions explicit and using Bayesian methods, a better predictive distribu- 
tion of a similar form can be obtained without any need for cross-validation. Here is how MacKay's 
method works. 

higher-than-expected relative frequencies with the r^^ subinterval. 
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As before, we are estimating the W x W probability matrix Q, where each element qiy is the 
probability that w^^y = i given that w^^y-i = j. The Bayesian way to do this is to choose some 
reasonable prior distribution for Q, without looking at the data, and then use Bayes's rule to get a 
posterior distribution that takes the data into account. MacKay uses Dirichlet priors, mainly because 
they have the useful property that the posterior has the same form as the prior, thus simplifying the 
mathematics. Dirichlet priors make use of a null measure, which is the vector of relative frequencies 
that one might expect given no data, and a control parameter, which measures how much spread 
around the null measure one expects to see in the data. MacKay hypothesizes a model for Q that 
uses a set of Dirichlet priors, one for each row q|j of the matrix, all sharing a common null measure, 
m, and control parameter, a, but otherwise independent. He adds an extra degree of freedom to 
the model for Q by considering m and a as unknowns to be estimated from the data. He calls this 
hypothesized model Hm- 

Suppose we already know m and a, and let D = si, . . . , represent the training corpus. Then 
by Bayes's rule, 

Pr(0 D, m, a, TLm) = 5-777^ tt^^ • (23) 

Pr(D I m, a, TLm) 

Since by T-Lm the rows of Q are independent except for m and a: 

Pr(0 I m, a, ?^m) = H I ^m). (24) 

i 

The Dirichlet priors for the rows of Q have the form 

Pr(q|,. |m,a,7^M) = ^nC""'^(E«'b--l) (25) 



where 



The likelihood is simply 



(26) 



vi{d\q,%m)=y{ n =nn4"'- (2^) 

s=l y=l j i 

The evidence, Pr(_D | m, UjHm), does not depend on Q, so we can treat it as a constant for now. 
It is therefore possible to derive the posterior distribution for Q as follows: 

p ,n I n ^ ^ Pr(iJ|Q,^M)n,Pr(%- |m,a,7^M) 

PrQ L>,m,a,?^M = BTTTi 28 

Pr(L' I m, a, TLm) 



n, a 4'-) (n, j- n. 4"--^he. g.i, - 1) 

Pr(_D I m, a, Hm) 



(29) 
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Pr(_D I va.,a,UM) 



nn4''"'"""^(E (30) 



Remember that the posterior distributions for the rows of Q are giving us the probabilities of 
the different values of qi\j, considering the data. The predictive probability of i given j is: 

Pr(i I i, _D,m,a,?^M) = j qi\jPr{qy \ D,in,a,TiM)d^ q.\j (31) 

This is just the mean of a Dirichlet distribution, which is known to be 

Fi\j + ami 



Pr(i I j,D,in,a,nM) 



Fi\j + ami 



Fj + a 



(32) 



where Fiy and Fj are the counts of bigram ji and token j in the training data as before. 

So far we have a formula for Qiy given that we know m and a, but m and a still need to be 
estimated from the data. This is also done by Bayesian inference. By Bayes's rule, 

T3 / I n -xy ^ Pr(-P | m, a, %m) Pr(m, a \ %m) 

Pr(m,a I D,nM) = ¥7{DjnM) ' ^^^^ 

The priors for a and m are uninformative (i.e. Pr(loga) and Pr(m) are constant as a and m, 
respectively, are varied), and independent, and Pr(_D | Hm) is independent of m and a, so we are 
left with 

Pr(m,a I £),?i:m) oc Pr(£) I m,a,?i:M)- (34) 

At this point we need an expression for Pr(_D | 111,0;,?^^), which previously appeared as a 
normalizing constant. We know that equation 30 is a product of normalized Dirichlet distributions, 
each of which has a normalizing constant where 

i 

^, ^ Hi r {Fi\j + ami) 
' r (E, + am,) • 



TT— = (36) 

11 Fr(D \in, a,-HM) ^ ' 

3 

Z' 

Vi{D\ui,a,nM) = (37) 



TT / a r(f,|,+am,) Vja) \ 

V \ r (E, F,\, + am,) n, r {ami) l ^ ' 
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Since nii = 1, we can define Ui = arrii and a = Ui to get an equation tliat depends only 



on u: 



We can recover m and a from u at any time by using the above identity. 

Strictly, we should integrate out u when calculating the predictive probabilities, as we did in 
equation 31. However, under certain conditions that are expected to hold in this case, it is possible 
to approximate using ump, the most probable value of u. That is 



vi{i\j,D,nM) = j vi{M\D,nM)Vi{i\j,D,M,nM)d^M (4o) 

(41) 
(42) 



Pr(i |i,i?,UMP,7^M) (41) 

Fi\j + Mj'MP 
Fj + Zli "sMP 



by equation 32. 

To find Ump, we take the log of equation 39, differentiate with respect to Ui, set the resulting 
expression equal to zero, and solve for Ui. The digamma function is defined by 

^{x) = -^\ogT{x) (43) 

and obeys the recurrence relation 

*(a; + 1) = *(a;) + i. (44) 

X 

For a; > 0.1 it can be approximated by 



■^{x)c^\og{x)-^ + 0{^). (45) 
2x x-^ 



Since 



logPr(i? I u,?^m) = ^ (logr (J^, 



+ Ui) - log r (Mi)) + logr(a) - logr (j;- + «) , (46) 



then 

^ logPr(i? I u,?^m) = ^ (* {Fj\i + Ui) - * {ui) + *(a) - * {F^ + a)) (47) 

OUi 

] 

using the definition of '^{x) above and the chain rule. For word bigram models trained on large 
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corpora, it is expected that Ui < 0.5 and a > 1. Application of recurrence relation 44 yields 

^{Fiy + Ui) - = + \- + • • • + + + — (48) 

Fi\j -1 + Ui Fi\j -2 + Ui 2 + Ui 1 + Ui Ui 

which, for Ui < 0.5 can be approximated by 

f - + rf-i'^ T + Ti for Fiu > 1 
[ for Fi\j = 0. 

Let Nji be the number of js for which Fiy > /, and define Gi = 7^ ^^'^ = S/=2 (j-iy • 

Then 

^ ^{ui + Fi\j) - ^{ui) ^^ + Gi- UiHi. (50) 

Ui 

Similarly, using approximation 45, 

* + „, - t(a) . l„s (£^) + \ {^^^) ^ (51) 



Define 



that 



j: , „,(ii±ii) 4 ,52, 



+ „)_(]I(„)ssA-(„), (53) 



Substituting approximations 50 and 53 into equation 47 yields 



-f- log Fi(D I u, ?^m) ^— + Gi- UiHi - K(a). (54) 



We set the right hand side of equation 54 equal to zero and solve for Ui 

2Nu 



K{a) - Gi + ^J{K{a)-Gi)' + AHiNii 



(55) 



Since Gi and Nn do not depend on u and can be pre-computed, all that remains is to find a such 
that a = Ui and the MjS satisfy equation 55. This can be done by choosing an initial value for 
a, using equation 55 to calculate the MjS, and then iteratively adjusting a until a = Xli^i- This 
method requires 0{W) computations per iteration for training. 

MacKay expects his Bayesian method to yield superior results over the deleted estimation method 



^ Trials with an earlier version of the algorithm presented found an optimal value for a of about 14. So, for a corpus 
with a 20,000 word vocabulary, the average value of would be expected to be around 2o"^ooo • 
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for two reasons. Equation 32 can be restated as 

i^y = (]^) + (i - f^y (56) 

Here, the expression -pqr^ plays the same role as A in the deleted estimation method, but there is an 
important difference. While in the deleted estimation method there are one or several As, each set by 
cross-validation from the training data, here there is effectively a separate Xj = for each word. 

These are generated automatically from the frequency of j in the training data, and embody the 
intuition that the more frequent a word is, the more reliable are the bigram statistics conditioned on 
that word. So the first advantage is that the individual AjS allow for greater sensitivity in the mixture. 
The second advantage is that, when a bigram ji does not occur in the training data, the default 
value of Qiy is nii, a parameter optimized especially for this role, rather than the relative frequency 
of i. This difference would be important in cases where i occurred very frequently following a certain 
frequent word k, but rarely elsewhere. In these cases, the relative frequency would produce a high 
probability for ji, whereas nii would average the high probability of ki with the low probabilities of 
other bigrams ending in i and produce a more moderate probability for ji. 

MacKay suggests an experimental comparison of his method to deleted estimation, but does not 
carry it out. 

2.4 Perplexity 

The predictive accuracy of competing models is often compared by evaluating the perplexity under 
each model of some common test data. 

Perplexity is defined as where H{P, Q) is the cross-entropy of the unknown "true" model 

P as measured by the estimated model Q. Brown et al. (1992) show that under some reasonable 
assumptions, 

F(P,0)= lim -^log2PrQ(Xi,...,XAr), (57) 

where Xi, . . . , Xjv is a previously unseen sample of data of size N . The better the model Q, the 
lower the cross-entropy H{P,Q), and the smaller the perplexity. 
Assuming that, for a large enough test sample, 

lim -llog2PrQ(Xi,...,XAr)~-llog2PrQ(Xi,...,XAr), (58) 



H(P, Q) ~ log2 PrQ(Xi, . . . , Xn). (59) 
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Some simple manipulation® produces 



Perplexity = PrQ(Xi, .. .,XAr) «. 



(60) 



But under the bigram models I am using, 
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N 



FrQ(Xu...,XN) = l[qx,\X-^ 



1-1 ' 



(62) 



t = 2 



SO 



N 



Perplexity = ]^gxi|Xi 



- 1 



(63) 



.t = 2 



So then, if the test sample is large enough, there is a simple method for comparing the accuracy of 
the competing models: calculate the perplexity of the test sample under each model, and the model 
with the lowest perplexity wins. 

The next section describes an experiment I designed to compare the accuracy of models obtained 
by using deleted estimation and MacKay's Bayesian methods. 

3 The Experiment 

The text corpus on which the experiment was based was taken from the English portion of Gale and 
Church's (1991) sentence-aligned version of the Canadian Hansard, the proceedings of the Canadian 
Parliament. This text had already been separated into sentences and stripped of titles, formatting 
codes, speaker identifiers, and so on (see sample in Appendix B). There are approximately 30 million 
words of data available in this format: I used about 3 million words of it for this experiment. 

Some further pre-processing was required to prepare the data for use by my programs. Sentence 
numbers were stripped, sentence-begin (<s>) and sentence-end (</s>) markers were added, and 
each sentence was placed on a single line. In keeping with the common practice for experiments of 
this type, punctuation and suffixes beginning with apostrophes were split off from the words they 
followed, becoming separate tokens. In order to reduce the total number of types in the vocabulary, 
I also replaced each number by the special token . 

The resulting sentences were divided into nine blocks of about 1.7 megabytes each, with con- 
secutive sentences going to different blocks. For example, sentences 1, 10, and 19 went to block 1, 

^ Shown to me by Peter Brown. 

^This neat formula is for conceptual purposes. In the implementation, a different manipulation yields 




(61) 



which is easier to implement in C. 
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sentences 2, 11, and 20 to block 2, sentences 3, 12, and 21 to block 3, and so on. This interleaving of 
the sentences was an important step, because different sections of the text deal with different topics, 
resulting in different token frequencies. Church and Gale (1991) found that significant differences 
in language use can occur between different sections of the same corpus, because at different times 
in the collection period, different topics are being discussed by different people. A model trained on 
the data from one time period may not fit the data from another time period. By mixing up the 
sentences, I distributed the data from each time period across all the blocks. The first six blocks 
were used for training data (about 2 million words), and the test data were extracted from the 
remaining three blocks. A sample from one of these blocks is shown in Appendix B. 

Three versions of the test sample were prepared. First, the smoothing methods being compared 
only assign probabilities to bigrams composed of tokens that appear in the training data, so they 
have no way of dealing with previously unseen tokens. Therefore, I removed all sentences that 
contained a token that did not occur in the training data. This left 14,393 sentences (about 260,000 
tokens) in Sample 1. Next, recognizing that the Hansard contains many conventional phrases and 
sentences that might skew the results of the experiment, I removed from Sample 1 all sentences that 
were duplicated in either the test data or the training data. This left 12,000 sentences (about 243,000 
tokens) in Sample 2. Finally, to test whether the sample was large enough for the approximation 
of perplexity in equation 59 to hold, I pseudo-randomly® chose half the sentences in Sample 2 to 
become Sample 3 (6000 sentences, about 116,000 tokens). 

The two smoothing methods have different numbers of parameters to be optimized. For the 
deleted estimation method, the number of As is set by the experimenter. In MacKay's Bayesian 
method, there is one Ui for each token in the training data vocabulary. In order to investigate 
the possible objection that MacKay's Bayesian method would fit the data better simply because 
of its many parameters, I ran the deleted estimation method with different numbers of As to judge 
the effect of different numbers of parameters. Also the convergence criteria for each model took 
number of parameters into account. Within each method, the training continued until on average 
each parameter of the model had converged to eight decimal places.^ Since some parameters were 
expected to be very small, it was hoped that eight decimal places would catch at least several 
significant digits. 

The experiment was conducted as follows. First, raw frequencies (Fj and Fji) and relative 
frequencies (fj and fiy) of tokens and bigrams were obtained from the training data as a whole. 
Next, the most probable values for the parameters of each model were solved for iteratively. For 
MacKay's Dirichlet model, this meant solving the simultaneous equations given in section 2.3 to 
obtain u. For deleted estimation, separate frequencies (/^^ and f^j) were first calculated for each 

^The sentences were sorted alphabetically and the first 6000 were taken. 

^More precisely, on each iteration, the differences for each parameter were summed. If the total was less than 
0.000000005 X the number of parameters, convergence was declared. 
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Model 



deleted estimation 



Mackay's 



Sample 



3 As 



15 As 



150 As 



Sample 1 



79.60 



79.90 



Sample 2 



89.57 



88.47 



88.91 



89.06 



Sample 3 



91.82 



92.28 



Table 1: Perplexities of the three test data samples under the different models. 



block, and then the set of equations given in section 2.2 was solved to obtain {A^}. The most 
probable parameter values for each model were then used to compute probabilities qiy for each 
bigram in the test data (i.e., the subset of Q that would actually be used). The result was, in effect, 
two completely instantiated, competing models of the word sequences in the Hansard. Finally, the 
perplexity of each of the three test data samples was evaluated using each of the models, and the 
results were compared. 

The experiment was run on a SPARCserver 1000 with dual "supersparc" CPUs and 128Mb of 
memory, using a collection of csh shell scripts, C programs, and UNIX utilities. In this environment, 
it took only a few hours of cpu time. 

The perplexity of each test sample under each model is given in Table 1. 

For all three samples, the perplexities under the deleted estimation model and under MacKay's 
Bayesian model are nearly the same. 

For Sample 2, three deleted estimation models having different numbers of As were tested. The 
effect of altering the number of As was very small, and of particular interest is the fact that perplexity 
did not steadily decrease as the number of As increased. This suggests that the large number of 
parameters in MacKay's Bayesian method does not give it a significant advantage. 

Finally, the perplexity results for Sample 3 are close to the corresponding results for Sample 2. 
This suggests that Sample 2 is large enough to provide a meaningful comparison of models. The fact 
that the perplexity results for Sample 1 are so much lower than those of Sample 2 probably refiects 
the high degree of regularity of the extra (conventional) data more than the small increase in test 
data size. 

With regard to resource use, MacKay's algorithm has an advantage. The number of iterations 
required for each algorithm to converge was comparable. However, a single iteration of MacKay's 
method requires time linear in the size of the vocabulary, while an iteration of deleted estimation 
requires time linear in the size of the training corpus. The larger the training corpus, the more 
significant would be this difference. Also, deleted estimation requires more disk space because it 
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keeps separate count and frequency data for each block of the training corpus. 



4 Related Work 

There are a number of other smoothing methods for estimating probability distributions. In this 
chapter, I summarize some of these methods and the comparisons that have been made among 
them. If a method exists in many variations, I have tried to present a basic form applied to bigram 
probabilities. 

One well-known and simple way to handle the problem of zero relative frequencies for bigrams 
not found in the training corpus is the method of initial counts. Each bigram ji is assigned an initial 
count of 1 or 0.5 before the training data are examined. The initial counts are added to the observed 
counts of each bigram in the training data, and the adjusted relative frequency 

initial 
(initial + J;-,,.,) 

is used as the probability of bigram ji in the predictive model. All bigrams with the same observed 
count have the same probability, and bigrams with an observed count of zero have a small probability 
due to the initial count. However, Gale and Church (1994) have shown that the method of initial 
counts gives very poor estimates for bigrams that do occur in the training data. 

The Good- Turing method (Good, 1953; Katz, 1987; Church and Gale, 1991) involves not only 
the observed counts of each bigram in the training corpus but also the number of bigrams that share 
each count. If N is the total number of bigrams in the corpus, and Nj- is the number of bigrams that 
have observed count r, then the Good- Turing estimate of the probability of a bigram with observed 
count r is 

(65) 



N 

Notice that all bigrams with the same count have the same probability, just as with the maximum 
likelihood estimator or the initial counts method. The probability of each bigram that doesn't appear 
in the training corpus is 

and the probability of all unobserved bigrams adds up to In practice, it is usually necessary 
to smooth the before using them in equation 65. Gale (1994) presents a simple way to smooth 
these #rS. 

Nadas (1984) introduces parametric empmcal Bayes estimation and compares it with deleted 
estimation and Good- Turing estimation using perplexity as the measure. Parametric empirical Bayes 
estimation obtains the probabilities of the bigrams as a result of Bayesian inference. A convenient 



15 



prior distribution with some unknown parameters is arbitrarily chosen, the posterior probabilities 
of the bigrams are expressed in terms of the prior, and the parameters of the prior are calculated 
from the relative frequencies of the bigrams in the training data. The method resembles MacKay's, 
except that it uses a different prior. Nadas's perplexity figures show that his method is very slightly 
worse than both Good- Turing and deleted estimation. 

Katz's (1987) "back off" method uses a fraction of the corresponding (n — l)-gram probability 
as the probability of an unseen n-gram. When applied to bigram models, his method yields the 
following formula for the conditional probability of a bigram that occurs r times: 

jj- if r > 5 (relative frequency) 

Pr(« \j)= { if < r < 5 (Good-Turing) (67) 

7f ifr = 

where Fi and Fj are the counts in the training corpus of words i and j respectively, and 7 is a 
normalizing constant. Katz claims that his method is time and space efficient, and his perplexity 
results show that it performs very slightly better than deleted estimation and parametric empirical 
Bayes. 

Church and Gale (1991) compare their "enhanced Good- Turing" and "enhanced deleted esti- 
mation" methods to several other standard estimators in a very well thought out and thorough 
experiment. They first separate the bigrams of the training data into groups depending on the 
"unigram estimator" , which is the product of the unigram relative frequencies of the two constituent 
words, and then apply another estimation method within each group. Applying Good- Turing within 
each group (interpreting Nj- as the number of bigrams m this group that have an observed count of 
r in the training data) yields the "enhanced Good- Turing" estimates. For the "enhanced deleted 
estimates", the training data are split into two halves labelled and 1. Then for each group of 
bigrams above, the following procedure is carried out. Within each half of the training data, count 
the number of distinct bigrams that occur r times (#j? and N^) and the total number of occurrences 
of those bigrams in the other half (C^^ and C^''). Then replace the actual observed count of the 
bigram by 

^01 , ^10 

%Tk 

in calculating the maximum likelihood estimator. For example, in a training corpus of N bigrams, 
suppose bigram aa occurs 300 times in half and 100 times in half 1, bigram hh occurs 300 times in 
half and 400 times in half 1, bigram cc occurs 200 times in half and 300 times in half 1, and no 
other bigrams occur exactly 300 times in either half. #300 would be 2, #300 would be 1, C300 would 
be 500, Cl''-' would be 200, and the probability of aa, which occurs 400 times in the training data. 
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would be 




500 + 200 1 
2+1 N 



233 

N 



(69) 



The enhanced Good- Turing estimates and the enhanced deleted estimates are compared with the 
unigram estimator, the maximum likelihood estimator, and Good- Turing using variances and t- 
scores. The results show that enhanced Good- Turing and enhanced deleted estimation are both 
better than any of the other methods, but enhanced Good- Turing is more efficient in its use of data. 
Church and Gale also discuss the need for careful sampling in building the training corpus, and 
present sophisticated measures for comparing methods. 

5 Conclusion 

I have shown that MacKay's Bayesian model performs as well as deleted estimation in tests on token 
bigrams, and requires fewer resources. Therefore it is reasonable to recommend that MacKay's 
Bayesian method replace deleted estimation in future work with token bigrams. 

However, most statistical natural language processing systems in the real world use trigrams, not 
bigrams. Also, Good- Turing estimation seems to be the most accurate of the smoothing methods 
currently in use. It would be interesting to see whether a Bayesian approach with a different prior 
model could outperform Good- Turing estimation on trigrams. 
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A Programs 

This appendix contains all the programs used to carry out the experiment. "Programs" includes C 
programs, csh scripts, awk scripts, and sed scripts. The topmost level is given in the form of three 
diagrams, showing the order and purpose of the programs. Program names are given in bold type. 
The program listings follow the diagrams, arranged in alphabetical order by name. 
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Text Preparation 



aligned Hansard 
(see Appendix B) 



preprocessing: 




graffllt< corpus >tl 


sed -f sed. script 


<tl >t2 



divide into blocks: 

awk -f spray.awk <t2 



Sample 1 



Sample2 



Samples 



test data 
block [7-9] 



extract covered sentences: 

csh get_test_data.csh block [7-9] 

cat *.test >Samplel 



remove duplicates: 

sort Samplel I uniq >t3 

sort block [1-6] t3 I uniq -d >dupl 

grep -f dupl -V -x <t3 >Sample2 



split in half: 
ed Sample2 
6000,12000¥ Samples 

q 
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Constructing the Models 



training data 
block[l-6]) 



counts for each block: 

csh count. csh "block [1-6]" 



block [1-6] .tok. counts \ 
block [1-6] .bigr . counts j 



deleted estimation: 

csh delint.csh corpus \ 

"block [1-6]" tolerance 



merge counts: 

csh merge. csh corpus "block [1-6]' 



/ initial guesses: 
I corpus . alpha 
V corpus . alpha. max .min 




corpus .tok. counts 
corpus . bigr . counts 



MacKay's method: 

csh mackay.csh corpus tolerance 



frequencies: 

csh freqs.csh corpus 



corpus . lambdas 



corpus . alpha 
corpus .uj. 



corpus . tok. f req 
corpus . bigr . f req 
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Testing the Models 




perplexity under each model: 

csh perplext.csh corpus sample 



/ perplexity under del. est. 
\ sample . delint . answer 



, / perplexity under MacKay: \ 
\ sample .mackay . answer j 
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alpha. c 



/* alpha. c */ 
/* 

Increases or decreases alpha depending on sum.u_i. 

Input: $corpus . alpha $corpus . sum.u_i $corpus . alpha. max .min 

Output: alpha on stdout, $corpus . alpha. max .min may be updated. 

*/ 

#include <stdio.h> 
#include <math.h> 
#define MAXMORDLEI 200 

main(int argc, char** argv) { 
FILE* alpha_fp; 
FILE* sum_fp; 
FILE* alpha_max_min_fp; 

double alpha, sum_u_i, alpha_max, alpha_min; 

if (argc! =4) { 
f pr intf ( stderr , 

"usage: '/.s alpha sum.u_i alpha . max . min\n" , 
argv [0] ) ; 
exit(2) ; 

>; 

/* get alpha */ 

if ( ! (alpha_fp=fopen(argv[l] ,"r"))) { 

fpr intf (stderr, "'/.s: error opening y,s\n" , argv [0] , argv [1] ) ; 
exit(2) ; 

>; 

f scanf (alpha_f p, '"/.Ig" ,&alpha) ; 
f close(alpha_fp) ; 

/* get sum_u_i */ 

if ( ! (sum_fp=fopen(argv[2] ,"r"))) { 

fpr intf (stderr, "'/.s: error opening y,s\n" , argv [0] , argv [2] ) ; 
exit(2) ; 

>; 

f scanf (sum_fp, "'/ilg" ,&sum_u_i) ; 
f close(sum_fp) ; 

/ * get alpha_max & alpha_min */ 

if ( ! (alpha_max_min_f p=f open(argv [3] , "r") ) ) { 

fpr intf (stderr, "'/.s: error opening y,s\n" , argv [0] , argv [3] ) ; 

exit(2) ; 

>; 

f scanf (alpha_max_min_fp, "y.lg y.lg" ,&alpha_max, &alpha_min) ; 
f close (alpha_max_min_fp) ; 

/* Calculate new alpha */ 
if (sum_u_i < alpha) { 
alpha_max=alpha ; 

if (alpha_min == -1) /* alpha_min not established yet */ 

alpha = alpha / 2 ; 
else 

alpha = sqrt (alpha_min * alpha_max) ; 

> 

else if (sum_u_i > alpha) { 
alpha_min=alpha ; 
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if (alpha_max == -1) /* alpha_max not established yet */ 

alpha = alpha * 2 ; 
else 

alpha = sqrt (alpha_min * alpha_max) ; 

>; 

printf ( "7. . 15g\n" , alpha) ; 

/* write alpha_max & alpha_min */ 

if ( ! (alpha_max_min_f p=f open(argv [3] , "w") ) ) { 

fprintf (stderr, "'/.s: error opening '/.s for write \n" , argv [0] , argv [3] ) ; 

exit(2) ; 

>; 

f printf (alpha_max_min_fp, "7, . 15g 7i . 15g \n" , alpha_max, alpha_min) ; 
f close (alpha_max_min_fp) ; 

return 0; 

> 



bigrams.c 

/* bigrams.c */ 
/* 

Given a sentence of tokens separated by whitespace on stdin, 
extracts all token bigrams and writes them one per line on stdout . 
*/ 

#define MAXMORDLEIGTH 80 

#include <stdio.h> 
#include <ctype.h> 

main(int argc, char **argv) { 
register int c, state, i, j; 
char buff er[MAX¥ORDLEIGTH] ; 
state=0 ; 

while ((c=getchar()) != EOF) { 
switch(state) { 

case 0: /* reading whitespace before first token of sentence */ 
if ( ! isspace(c) ) { /* start of first token found */ 
state=l ; 
i=0; 

buff er[i++]=c; 

> 

break; 

case 1: /* reading and saving first token */ 
if (c=='\n') state=0; 

else if (isspace(c)) state=2; /* end of token found */ 

else buf f er [i++] =c ; 

break; 

case 2: /* eating white space between tokens */ 
if (c=='\n') state=0; 

else if ( ! isspace(c) ) { /* next token found */ 
state=3 ; 

/* output stored token in buf f er [0 . . i-1] */ 
for(j=0; j<i; j++) putchar (buf f er [j] ) ; 
putchar('\ '); 
i=0; 
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/* save and print first char of found token */ 
buff er[i++]=c; 
put char (c) ; 

>; 

break; 

case 3: /* saving and printing a token */ 
if (c=='\n') { /* end of sentence */ 
state=0 ; 
put char ( ' \n' ) ; 

> 

else if (isspace(c)) { /* end of token */ 
state=2 ; 
put char ( ' \n' ) ; 

> 

else { 

buff er[i++]=c; 
put char (c) ; 

>; 

break; 

>; 

>; 

if (state==3) putchar ( ' \n' ) ; 
return 0; 

> 



bigrfreq.c 

/* bigrfreq.c */ 
/* 

Calculates the conditional relative frequency of ilj, given 
the total number of bigrams beginning with j and the count of 
bigram ji, and prints it on stdout as follows: freq j i . 
*/ 

#include <stdio.h> 
#include <string.h> 
#define MAXMORDLEI 200 

main(int argc, char **argv) { 

FILE *tokcountf ile ; /* total bigrams starting with j */ 
FILE *bigrcountf ile ; /* count j i */ 

double tokcount, bigrcount; 

char j [MAXMORDLEI] , rest [MAXMORDLEI] , token [MAXMORDLEI] ="" ; 
int result; 

if (argc! =3) { 
f pr intf ( stderr , 

"Usage: bigrfreq corpus .bigr. tots corpus .bigr. counts\n" 
return 1 ; 

>; 

tokcountfile = f open(argv [1] , "r"); 
bigrcountf ile = f open(argv [2] , "r"); 

while ( (result=fscanf (bigrcountf ile, "'/.Ig '/.s '/.s" ,&bigrcount , j ,rest 
&& result !=EOF) { 
while (strcmp(token, j ) ) { 

if (fscanf (tokcountfile, '"/.Ig '/.s", fetokcount, token)==EOF) { 
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fprintf(stderr, "error in y,s\n" , argv [2] ) ; 
exit(2) ; 

>; 

>; 

printf ("'/,. 15g '/.s y,s\n", bigrcount/tokcount, j, rest); 

>; 

f close(bigrcountf ile) ; 
f close(tokcountf ile) ; 
return 0; 

> 



count. csh 

#count . csh 

#Counts tokens and bigrams in a set of text files. 
#Input parameter is file name pattern. 

#Input files are those files indicated by file name pattern. 
#Output files are $f ile . tok . counts , $f ile .bigr . counts for each 
#input file. 

set f iles=$argv [1] 

#Count tokens and bigrams for each input file, 
echo Started count. csh 
foreach file ($files) 

tokens <$file I sort luniq -c >$f ile . tok . counts 
if ($status) then 

echo tokens failed on $file ; suspend 
endif 

bigrams <$file I sort luniq -c >$f ile .bigr . counts 
if ($status) then 

echo bigrams failed on $file ; suspend 
endif 
end 



delint.csh 

#delint . csh 

#Calculation of probabilities by deleted estimation method. 
#Parameters are corpus name, file (block) name pattern, tolerance 
#Input files are $f ile . tok . counts , $f ile .bigr . counts . 
#Output file is $corpus . lambdas . 

#count.csh must have already been run, and $corpus . lambdas must 
#have been initiallized. 

set corpus=$argv [1] 
set f iles=$argv [2] 
set tolerance=$argv [3] 

set nlambdas=3 
@ iterations=100 

echo Starting delint.csh 

#Make a set of relative frequency files, each omitting one file f 
#the counts, 
foreach file ($files) 
echo Block $file 
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# Create corpus count files, each omitting one file, 
touch $corpus-$file . tok . counts 

touch $corpus-$file.bigr. counts 
foreach otherfile ($files) 

# Merge all count files but $file. 
if ($otherfile != $file) then 

mergehist $corpus-$file . tok . counts $otherfile . tok . counts >$corpus . temp 
if ($status) then 

echo mergehist failed on $otherfile . tok . counts ; suspend 
endif 

mv $corpus.temp $corpus-$file . tok . counts 

mergehist $corpus-$file .bigr . counts $otherfile .bigr . counts >$corpus . temp 
if ($status) then 

echo mergehist failed on $otherfile .bigr . counts ; suspend 
endif 

mv $corpus.temp $corpus-$file .bigr . counts 
endif 
end 

# Calculate number of tokens in corpus - file 

awk '$2!="<s>"{sum+=$l>EID{print sum}' <$corpus-$file .tok . counts \ 
>$corpus-$f ile . tok . tot 

# Calculate relative frequencies ( - file) of tokens and bigrams 

tokfreq $corpus-$f ile.tok.tot $corpus-$f ile . tok . counts >$corpus-$f ile.tok.freq 
if ($status) then 

echo tokfreq failed ; suspend 
endif 

bigrfreq $corpus-$f ile . tok . counts $corpus-$f ile .bigr . counts \ 

>$corpus-$f ile . bigr . f req 
if ($status) then 

echo bigrfreq failed ; suspend 
endif 

# Make intermediate file 

combine $corpus-$f ile .bigr . f req $f ile .bigr . counts |\ 

awk '$2!=0{print $1,$3,$4,$2>' >$corpus-$f ile . f _j i . j . i .F_j i 
sort +1 $corpus-$f ile . tok. f req $corpus-$f ile . f _j i . j . i .F_j i I \ 

awk 'IF==2{f j=$l>IF==4{print $2,$3,$4,$l,f j>' \ 

>$corpus-$f ile . j . i . F_j i . f _j i . f _j 
sort +1 $corpus-$f ile . j . i . F_j i . f _j i . f _j $corpus-$f ile . tok . f req |\ 

awk 'IF==2{f i=$l>IF==5{print $l,$2,$3,$4,$5,fi>' \ 

>$corpus-$f ile . j . i . F_j i . f _j i . f _j . f _i 

# Delete garbage before proceeding 

rm $corpus-$f ile . f _ji . j . i . F_j i $corpus-$f ile . j . i .F_j i .f _j i .f _j 
rm $corpus-$f ile. bigr. f req $corpus-$f ile.tok.freq 

rm $corpus-$f ile.tok.tot $corpus-$f ile. bigr. counts $corpus-$f ile . tok . counts 
end 

#Search for lambda_h's that maximize the probability of the 
Straining corpus . 

#Determine where to look for a zero of 1' wrt lambda_h, by 
#evaluating the log likelihood (1) and its derivative wrt lambda_h 
#(1') for several values of each lambda_h. These are 
#big sums with a contribution from each $f ile (block) . 
#My approach is to calculate for each $file and each h, the term 
#that file contributes to 1 and 1' given a certain value for 
#lambda_h. Then when all files have been processed, the terms 
#are added. This reduces the amount of file access required for 
#the calculations. 

#Create new accumulator files 
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touch $corpus.l_0 
touch $corpus.l_l 
touch $corpus.lp_0 
touch $corpus . lp_ . 25 
touch $corpus . lp_ . 5 
touch $corpus . lp_ . 75 
touch $corpus.lp_l 

foreach file ($files) 

termsl $nlambdas $corpus-$f ile . j . i .F_j i .f _j i . f _j .f _i $corpus-$file . terms . 1_0 \ 
$corpus-$file . terms . 1_1 $corpus-$f ile . terms . lp_0 \ 
$corpus-$f ile . terms . lp_ . 25 $corpus-$f ile . terms . lp_ . 5 \ 
$corpus-$f ile .terms . lp_ . 75 $corpus-$f ile .terms . lp_l 
if ($status) then 

echo termsl failed on $file ; suspend 
endif 

mergehist $corpus.l_0 $corpus-$f ile . terms . 1_0 > $corpus.temp 
if ($status) then 

echo mergehist failed on $corpus-$f ile . terms . 1_0 ; suspend 
endif 

mv $corpus.temp $corpus.l_0 

mergehist $corpus.l_l $corpus-$f ile . terms . 1_1 > $corpus.temp 
if ($status) then 

echo mergehist failed on $corpus-$f ile . terms . 1_1 ; suspend 
endif 

mv $corpus.temp $corpus.l_l 

mergehist $corpus.lp_0 $corpus-$f ile . terms . lp_0 > $corpus.temp 
if ($status) then 

echo mergehist failed on $corpus-$f ile . terms . lp_0 ; suspend 
endif 

mv $corpus.temp $corpus.lp_0 

mergehist $corpus . lp_ . 25 $corpus-$f ile . terms . lp_ . 25 > $corpus.temp 
if ($status) then 

echo mergehist failed on $corpus-$f ile . terms . lp_ . 25 ; suspend 
endif 

mv $corpus.temp $corpus . lp_ . 25 

mergehist $corpus . lp_ . 5 $corpus-$f ile . terms . lp_ . 5 > $corpus.temp 
if ($status) then 

echo mergehist failed on $corpus-$f ile . terms . lp_ . 5 ; suspend 
endif 

mv $corpus.temp $corpus . lp_ . 5 

mergehist $corpus . lp_ . 75 $corpus-$f ile . terms . lp_ . 75 > $corpus.temp 
if ($status) then 

echo mergehist failed on $corpus-$f ile . terms . lp_ . 75 ; suspend 
endif 

mv $corpus.temp $corpus . lp_ . 75 

mergehist $corpus.lp_l $corpus-$f ile . terms . lp_l > $corpus.temp 
if ($status) then 

echo mergehist failed on $corpus-$f ile . terms . lp_l ; suspend 
endif 

mv $corpus.temp $corpus.lp_l 
rm $corpus-$f ile . terms . * 
end 

initlamb $corpus.l_0 $corpus.l_l $corpus.lp_0 $corpus . lp_ . 25 $corpus . lp_ . 5\ 

$corpus . lp_ . 75 $corpus.lp_l >$corpus . lambdas . work 
if ($status) then 

echo initlamb failed ; suspend 
endif 

rm $corpus.lp_* $corpus.l_* 
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#Calculate optimum lambda_h's by binary search. 

#Each of the lambda_h's is actually solved for independently, 

#but to reduce file access the processes are interleaved. 

@ iterid=l 

echo Calculating lambda_hs 

#For each iteration of binary search 
startloop : 

echo Iteration $iterid 

#The partial derivatives 1' wrt lambda_h is a big sum with a 
#contribution from each $f ile (block) . My approach is to 
#calculate for each $file, the term it contributes to 1' wrt 
#lambda_h for each lambda_h, and accumulate them in sum 
#accumulation file $corpus.lp. Then when all $file's have been 
#processed, I calculate the new lambdas. This reduces the amount 
#of file access required for the calculations. 

#Create new sum accumulation files 
touch $corpus.lp 

#Accumulate sums, 
foreach file ($files) 

terms2 $corpus . lambdas . work $corpus-$f ile. j .i.F_ji.f_ji.f_j .f_i \ 

>$corpus-$f ile . terms . Ip 
if ($status) then 

echo terms failed on $file ; suspend 
endif 

mergehist $corpus.lp $corpus-$f ile . terms . Ip > $corpus.temp 
if ($status) then 

echo mergehist failed on $corpus-$f ile . terms . Ip ; suspend 
endif 

mv $corpus.temp $corpus.lp 
rm $corpus-$f ile . terms . Ip 
end 

#Calculate new lambda_h's. 

mv $corpus . lambdas . work $corpus . lambdas . $iterid 

lambda $corpus . lambdas . $iterid $corpus.lp > $corpus . lambdas . work 
if ($status) then 

echo bracket failed ; suspend 
endif 

#rm $corpus.lp 

#Check if convergence reached 

awk '{print $1}' <$corpus . lambdas . $iterid >$corpus . tempi 

awk '{print $1}' <$corpus . lambdas . work >$corpus . temp2 

diffrnc $corpus . tempi $corpus . temp2 $tolerance >>$corpus . dif f rnc 

switch ($status) 

case 0: 
breaksw 

case 1: 

if ($iterid > $iterations) then 

echo Too many iterations ; suspend 
endif 

@ iterid++ ; goto startloop 
breaksw 
case 2: 
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echo diffrnc failed ; suspend 
endsw 

echo Convergence reached 

mv $corpus . temp2 $corpus . lambdas 

#Cleanup 

rm $corpus-* . j . i . f _j i .f _j . f _i 

rm $corpus . diffrnc $corpus . tempi $corpus . temp2 $corpus-* 



diffrnc. c 

/* diffrnc. c */ 
/* 

Calculates the sum of the differences between corresponding 
records of two parallel files of numbers, and checks the total 
difference against the tolerance. 

Used to compare old and new alpha's and m_i's in Mackay's method, 
and to compare old and new lambda's in deleted interpolation method. 
Input is oldfile newfile tolerance. 

(oldfile and newfile must have the same number of records.) 
Output is the total difference on stdout . 

Return code is if total difference <= tolerance, 1 if total 

difference > tolerance, or 2 for error. 

*/ 

#include <stdio.h> 
#include <math.h> 
#include <stdlib.h> 

main(int argc, char **argv) { 
FILE *oldfile; 
FILE *newfile; 

double old, new, dif f erence=0 ; 
if (argc! =4) { 

f printf (stderr, "Usage : diffrnc oldfile newfile tolerance\n") ; 
exit(2) ; 

>; 

if ((oldfile = fopen(argv[l] , "r") )==IULL) { 

f printf (stderr, "diffrnc: can't open y,s\n" , argv [1] ) ; 
exit(2) ; 

>; 

if ((newfile = f open(argv [2] , "r") )==IULL) { 

f printf (stderr, "diffrnc: can't open y,s\n" , argv [2] ) ; 
exit(2) ; 

>; 

while (fscanf (oldfile, , &old)!=EOF) { 

if (fscanf (newfile, , &new)==EOF) { 

f printf (stderr , "dif frnc : files are different lengths\n"); 
exit(2) ; 

>; 

difference += fabs(new - old); 

>; 

if (fscanf (newfile, , &new)!=EOF) { 

fprintf (stderr, "diffrnc: files are different lengths\n"); 
exit(2) ; 
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>; 

printf ("•/.. 15g\n", difference) ; 

if (difference > atof (argv [3] ) ) return 1; 
else return 0; 

> 



dprobs.c 

/* dprobs.c */ 
/* 

Given the lambdas, unigram frequencies and bigram frequencies, 
it calculates the lambda mixture probabilities for the bigrams 
in the test data. 

Input is corpus . lambdas , corpus . j . i . f_j i . f_j . f_i (unigram and 
bigram frequencies, sorted by j then i, for only those bigrams found 
in the test data) . 

Output is prob (a.k.a. q_ji),j,i , sorted by j then i, on stdout . 
*/ 

#include <stdio.h> 
#include <math.h> 
#define MAXMORDLEI 200 
#define MAXLAMBDAS 200 

main(int argc, char* argv[]) { 

FILE *freqs, *lambdas; 

char i [MAXMORDLEI] , j [MAXMORDLEI] ; 

char temp_char [2] ; 

int r, h; 

double f_ji, f_i, f _j , prob, lambda [MAXLAMBDAS+1] , temp_double; 
doubl e int erval ; 

/* check parameters */ 
if (argc! =3) { 
f printf ( stderr , 

"Usage: '/.s corpus . lambdas corpus . j . i . f_j i . f_j . f_i \n" , 
argv [0] ) ; 
exit(2) ; 

>; 

/* load lambdas and calculate interval size*/ 

/* For simplicity, h runs from 1 to r rather than to r-1. */ 
if ((lambdas=fopen(argv[l] , "r")) == IULL){ 

fprintf (stderr, "dprobs : can't open y,s\n" , argv [1] ) ; 

exit(2) ; 

>; 

r=l; 

while (r<=MAXLAMBDAS && 

fscanf (lambdas , "'/.Ig '/.Is" ,&temp_double ,temp_char) ! =EOF) { 
lambda [r] =temp_double; r++;}; 
if (r>MAXLAMBDAS && 

fscanf (lambdas , "'/.Ig '/.Is" ,&temp_double ,temp_char) ! =EOF) { 

fprintf ( stderr , 

"dprobs: Too many lambdas. Change MAXLAMBDAS and recompile. \n 
exit(2) ; 

>; 

r — ; /* r now represents the number of lambdas found in the file * 
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f close(lambdas) ; 

interval=0 . 03/r ; /* size of interval for lambdas 1 through r-1 */ 
/* Open files */ 

if((freqs = f open(argv[2] , "r")) == ISFULL){ 

fprintf (stderr, "dprobs : can't open y,s\n" , argv [2] ) ; 
exit(2) ; 

>; 

/* For each bigram in corpus . j . i . f_j i . f_j . f_i (i.e. each bigram in 
the test data) , calculate the probability 

*/ 

while (f scanf (freqs, '"/.s '/.s '/.Ig '/.Ig '/.Ig", 

j, i, &f_ji, &f_j, &f_i)!=EOF) { 
h=f loor(f _j/interval + 1); 
if (h>r) h=r; 

prob=lambda[h]*f_i + (l-lambda[h] )*f _ji; 
printf ("•/.. 15g y.s y.s\n",prob, j, i) ; 

>; 

f close(f reqs) ; 
return 0; 

> 



freqs. csh 
#f reqs . csh 

#Calculates relative frequencies of bigrams in a corpus. 
#Parameter is corpus name. 

#Input files are $corpus . tok . counts , $corpus .bigr . counts . 
#Output files are $corpus .bigr .freq, $corpus . tok . f req. 

set corpus=$argv [1] 

echo Starting freqs. csh 

awk '$2!="<s>"{sum+=$l>EID{print sum}' <$corpus . tok . counts >$corpus . tok . tot 
tokfreq $corpus . tok . tot $corpus . tok . counts >$corpus . tok . freq 
if ($status) then 

echo tokfreq failed ; suspend 
endif 

bigrfreq $corpus . tok . counts $corpus .bigr . counts >$corpus .bigr .freq 
if ($status) then 

echo bigrfreq failed ; suspend 
endif 

rm $corpus . tok . tot 



GJ.c 

/* G_i.c */ 
/* 

Calculates G_i, H_i and V_i from ISF_fi.i.f . 

Input: ISF_fi i f sorted in reverse order by i then f on stdin 

(ISF_fi is the number of j's for which F_ji = f.) 
Output: G_i H_i V_i i sorted in reverse order by i on stdout . 
*/ 

#include <stdio.h> 
#include <string.h> 
#include <math.h> 
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#define MAXMORDLEI 200 

main(int argc, char** argv) { 

long int f, last_f, f_max_i, last_ISF_fi, ISF_fi; 
double G_i, H_i; 

char i [MAXMORDLEI] , save_i [MAXMORDLEI] ="" ; 
int result; 

if (argc!=l) { 

f printf (stderr, "usage : '/.s \n" , argv[0]); 
exit(2) ; 

>; 

f_max_i=0; last_I_f i=0 ; last_f=0; G_i=0; H_i=0; 
while (1) { 

result=scanf ('"/.Id %s '/.Id" ,&I_f i , i ,&f ) ; 
/* if i changes or end of file */ 
if (result==EOF II strcmp(save_i , i) ) { 
/* catch up last i */ 
while (last_f>2) { 
last_f — ; 

G_i+=last_I_fi/ (double) (last_f-l) ; 
H_i+=last_I_f i/(double)pow(last_f-l,2) ; 

>; 

if (last_f>l) { 

printf ("•/.. 15g °/..lBg V.ld y.s\n" ,G_i,H_i,last_I_f i,save_i) ; 
last_f=l; 

>; 

if (result==EOF) break; 
/* start new i */ 
G_i=0; H_i=0; 
last_I_fi = I_fi; 
last_f = f_max_i = f; 
strcpy (save_i , i) ; 
if (f>l) { 

G_i+=I_fi/ (double) (f-1) ; 

H_i+=I_f i/(double)pow(f-l,2) ; 

> 

else 

printf ("•/.. 15g °/..lBg V.ld y.s\n" ,G_i,H_i,I_f i,i) ; 

> 

/* if i stays the same */ 
else { 

/* catch up f */ 

while (last_f>f+l) { 
last_f — ; 

G_i+=last_I_fi/ (double) (last_f-l) ; 
H_i+=last_I_f i/(double)pow(last_f-l,2) ; 

>; 

/* do this line */ 
last_I_f i=I_f i; 
last_f=f ; 
if (f>l) { 

G_i+=I_fi/ (double) (f-1) ; 

H_i+=I_f i/(double)pow(f-l,2) ; 

> 

else 

printf ("•/.. 15g °/..lBg V.ld y.s\n" ,G_i,H_i,I_f i,i) ; 

>; 

>; 
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return 0; 

> 



get_test_data.csh 

set corpus=big 

set original=$argv [1] 

tokens < $original I sort luniq -c > $original . tok . counts 
combine $corpus . tok . counts $original . tok . counts |\ 

awk '$l==0{print $3}' > $corpus . $original . vocab. dif f 
fgrep -V -f $corpus . $original . vocab . dif f $original > $original.te 

grafRlt.c 

/* 

graf f ilt . c 

Removes embedded newlines from sentences in aligned hansards 

obtained from Dave Graff . stdin to stdout 

*/ 

#include<stdio . h> 

main () { 

register char charl,char2; 

if ((charl=getchar())==EOF) { 

fprintf (stderr, "graffilt: no input\n"); 
exit(2) ; 

>; 

while ((char2=getchar()) !=EOF) { 

if (charl!='\n' II char2!=' ') put char (char 1) ; 
char 1= char 2 ; 

>; 

put char (char 1) ; 
return 0; 

> 



initlamb.c 

/* initlamb.c */ 
/* 

Determines whether and where to search for a zero of 1'. 

If there is a zero of 1' in the interval [0,1], it produces an 

initial guess. If there is no zero 

in the interval, it determines which endpoint maximizes the log 
likelihood. 

Input files: $corpus.l_0 $corpus.l_l $corpus.lp_0 $corpus . lp_ . 25 

$corpus . lp_ . 5 $corpus . lp_ . 75 $corpus.lp_l 
Output is lambda_h final? (y or n) on stdout. 
*/ 

#include <stdio.h> 

main(int argc, char **argv) { 
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FILE *1_0, *1_1, *lp_0, *lp_25, *lp_5, *lp_75, *lp_l; 

double l_0_h, l_l_h, lp_0_h, lp_25_h, lp_5_h, lp_75_h, lp_l_h; 
int h; 

/* Initiallization */ 
if (argc!=8) { 

fprintf (stderr, "initlamb: Mrong number of parameters/n") ; 
exit(2) ; 

>; 

if ((l_0=fopen(argv[l] ,"r")) == HULL) { 

fprintf (stderr, "initlamb: can't open y,s\n" , argv [1] ) ; 
exit(2) ; 

>; 

if ((l_l=fopen(argv[2] ,"r"))==IULL) { 

fprintf (stderr, "initlamb: can't open y,s\n" , argv [2] ) ; 
exit(2) ; 

>; 

if ((lp_0=fopen(argv[3] ,"r"))==IULL) { 

fprintf (stderr, "initlamb: can't open y,s\n" , argv [3] ) ; 
exit(2) ; 

>; 

if ((lp_25=fopen(argv[4] ,"r"))==IULL) { 

fprintf (stderr, "initlamb: can't open y,s\n" , argv [4] ) ; 
exit(2) ; 

>; 

if ((lp_5=fopen(argv[5] ,"r"))==IULL) { 

fprintf (stderr, "initlamb: can't open y,s\n" , argv [5] ) ; 
exit(2) ; 

>; 

if ((lp_75=fopen(argv[6] ,"r"))==IULL) { 

fprintf (stderr, "initlamb: can't open y,s\n" , argv [6] ) ; 
exit(2) ; 

>; 

if ((lp_l=fopen(argv[7] ,"r"))==IULL) { 

fprintf (stderr, "initlamb: can't open y,s\n" , argv [7] ) ; 
exit(2) ; 

>; 

/* For each lambda_h, determine whether 1' wrt lambda_h has a 
zero in [0,1]. If yes, determine which subinterval it falls 
in and initialize lambda_h. If no 
determine which endpoint maximizes this lambda_h's 
contribution to the log likelihood. 

*/ 

/* Read the relevant figures for a lambda_h */ 
while (fscanf(l_0, "y.lg y.d" , &l_0_h, &h)!=EOF) { 
if (fscanf (l_l,"y.lg y.d" ,&l_l_h,&h) == EOF) { 

fprintf (stderr, "initlamb: error in file y,s\n" , argv [2] ) ; 

exit(2) ; 

>; 

if (fscanf (lp_0, "y.lg y.d" ,&lp_0_h,&h) == EOF) { 

fprintf (stderr, "initlamb: error in file y.s\n" , argv [3] ) ; 
exit(2) ; 

>; 

if (fscanf (lp_25, "y.lg y.d" ,&lp_25_h,&h) == EOF) { 

fprintf (stderr, "initlamb: error in file y.s\n" , argv [4] ) ; 
exit(2) ; 
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>; 

if (fscanf (lp_5,"y.lg '/.d" ,&lp_5_h,&h) == EOF) { 

fprintf (stderr, "initlamb: error in file y,s\n" , argv [5] ) ; 
exit(2) ; 

>; 

if (fscanf (lp_75,"y.lg y.d" ,&lp_75_h,&h) == EOF) { 

fprintf (stderr, "initlamb: error in file y,s\n" , argv [6] ) ; 
exit(2) ; 

>; 

if (fscanf (lp_l,"y.lg y.d" ,&lp_l_h,&h) == EOF) { 

fprintf (stderr, "initlamb: error in file y,s\n" , argv [7] ) ; 
exit(2) ; 

>; 

fprintf ( stderr , 

"l_0=y..l5g l_l=y..l5g lp_0=y..l5g lp_25=y..l5g lp_5=y..l5g lp_75=y..l5g lp_l=y..l5g\n", 
l_0_h,l_l_h,lp_0_h,lp_25_h,lp_5_h,lp_75_h,lp_l_h) ; 

/*If we happen to have hit the zero already, output it as 
final. */ 

if (lp_25_h==0) printf("0.25 y\n"); 
else if (lp_5_h==0) printf("0.5 y\n"); 
else if (lp_75_h==0) printf("0.75 y\n"); 

/*If the signs of l'(0) and l'(l) differ there is a zero 
of 1' in [0,1]. Find the subinterval in which it occurs. 

*/ 

else if (lp_0_h<0 && lp_l_h>0) { 

if (lp_0_h<0 && lp_25_h>0) printf ("0. 125 n\n"); 
else if (lp_25_h<0 && lp_5_h>0) printf ("0.375 n\n"); 
else if (lp_5_h<0 && lp_75_h>0) printf ("0.625 n\n"); 
else if (lp_75_h<0 && lp_l_h>0) printf ("0.875 n\n"); 

> 

else if (lp_0_h>0 && lp_l_h<0) { 

if (lp_0_h>0 && lp_25_h<0) printf ("0. 125 n\n"); 
else if (lp_25_h>0 && lp_5_h<0) printf ("0.375 n\n"); 
else if (lp_5_h>0 && lp_75_h<0) printf ("0.625 n\n"); 
else if (lp_75_h>0 && lp_l_h<0) printf ("0.875 n\n"); 

> 

/* Otherwise choose the endpoint that maximizes the log 
likelihood. */ 

else if (l_0_h>l_l_h) printf ("0 y\n"); 
else printf ("1 y\n") ; 

>; 

f close(l_0) ; 
f close(l_l) ; 
f close(lp_0) ; 
f close(lp_25) ; 
f close(lp_5) ; 
f close(lp_75) ; 
f close(lp_l) ; 

return 0; 
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K.c 



/* K.c */ 
/* 

Evaluates K(alpha) . 

Input: $corpus . alpha as a file, $corpus . tok . counts on stdin 

Output: K(alpha) on stdout . 

*/ 

#include <stdio.h> 
#include <math.h> 
#define MAXMORDLEI 200 

main(int argc, char** argv) { 
FILE* alpha_fp; 
char j [MAXMORDLEI] ; 
long int F_j ; 

double alpha, sum=0, temp; 



if (argc! =2) { 

f printf (stderr, "usage : '/.s alpha < tok. counts \n" , argv[0]); 
exit(2) ; 

>; 

/* get alpha */ 

if ( ! (alpha_fp=fopen(argv[l] ,"r"))) { 

f printf (stderr, "'/.s: error opening y,s\n" , argv [0] , argv [1] ) ; 
exit(2) ; 

>; 

f scanf (alpha_f p, '"/.Ig" ,&alpha) ; 
f close(alpha_fp) ; 

/* evaluate K( alpha) */ 
while (scanf ('"/.Id y.s",&F_j ,j) !=EOF) { 
temp=alpha+F_j ; 

sum+=log(temp/alpha) + 0.5 * F_j/(alpha * temp); 

>; 

printf ( "7. . 15g\n" , sum) ; 
return 0; 

> 



lambda. c 

/* lambda. c */ 
/* 

Increases or decreases lambda's depending on $corpus.lp. 

Input: $corpus . lambdas . work $corpus.lp 

(lambdas . work has lambda, final?, max, min) 

Output: lambda, final?, max, min on stdout 

*/ 

#include <stdio.h> 
#include <math.h> 
#define MAXMORDLEI 200 

main(int argc, char** argv) { 
FILE* lambdas; 
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FILE* Ip; 

double lambda_h, lp_h, max_h, min_h, new_lambda; 
char final [2] ; 
int h; 

if (argc!=3) { 
f pr intf ( stderr , 

"usage: '/.s lambda. work lp\n" , 
argv [0] ) ; 
exit(2) ; 

>; 

if ((lambdas=fopen(argv[l] ,"r")) == HULL) { 

fprintf (stderr, "lambda: can't open y,s\n" , argv [1] ) ; 
exit(2) ; 

>; 

if ((lp=fopen(argv[2] ,"r"))==IULL) { 

fprintf (stderr, "lambda: can't open y,s\n" , argv [2] ) ; 
exit(2) ; 

>; 

/* For each lambda_h, if it's not final, calculate the new value 
and print it on stdout, otherwise pass the old value through. 

*/ 

while (fscanf (lambdas, "'/.Ig '/.Is", &lambda_h, final) !=EOF) { 
if (fscanf dp, "'/.Ig '/.d" ,&lp_h,&h)==EOF) { 

fprintf (stderr, "lambda: error on file y,s\n" , argv [2] ) ; 
exit(2) ; 

>; 

if (final [0]=='n') { 

fscanf (lambdas , Yilg", &max_h, &min_h) ; 

if (lp_h>0) { 

min_h=lambda_h ; 

new_lambda= (max_h+min_h) / 2 ; 

printf ("'/,. 15g n '/,.lSg 7, . 15g\n" ,new_lambda,max_h,min_h) ; 

> 

else if (lp_h<0) { 
max_h=lambda_h ; 
new_lambda= (max_h+min_h) / 2 ; 

printf ("'/,. 15g n '/,.lSg 7, . 15g\n" ,new_lambda,max_h,min_h) ; 

> 

else printf ("7i. 15g y\n" , lambda_h) ; 

> 

else /* final != 'n' */ 

printf ("7.. 15g 7.s\n" , lambda_h, final); 

>; 

f close(lambdas) ; 
f close(lp) ; 

return 0; 



mackay.csh 
#mackay . csh 

#Estimates alpha and m_i's needed for calculating probabilities 
#by MacKays method. 

#Command line parameters are corpus name, tolerance. 
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#Input files are $corpus . tok . counts , $corpus .bigr. counts, 
#$corpus . alpha, and $corpus . alpha. max. min. 
#Output files are $corpus . alpha, $corpus.m_i. 
#count.csh and merge. csh must be run before this. 

#$corpus . alpha $corpus . alpha. max. min must have been externally initialized. 

set corpus=$argv [1] 
set tolerance=$argv [2] 
@ iterid=0 
@ iterations=100 

echo Starting mackay.csh 

#Calculate I_fi's. 

awk '{printf "'/.s y,08d\n" , $3,$1}' <$corpus .bigr . counts I sort -r I uniq -c |\ 
awk '$2!=i{sum=0;i=$2>{sum+=$l;print sum, $2, $3}' >$corpus . I_f i . i . f 

#Calculate G_i, H_i and V_i 

G_i <$corpus.I_fi.i.f >$corpus . G_i . H_i . V_i 

#Repeatedly calculate K(alpha), {u_i} and new alpha values until 
#convergence is reached. 

echo >$corpus . dif f rnc 

startloop : 

echo Iteration $iterid 

#Save old alpha and u_i. 

cp $corpus . alpha $corpus . alpha. $iterid 

cp $corpus . alpha. max .min $corpus . alpha. max. min. $iterid 

if ($iterid > 0) then 

cp $corpus.u_i $corpus .u_i . $iterid 

rm $corpus .neg.f lag 
endif 

#Calculate K(alpha) 

K $corpus . alpha <$corpus . tok . counts >$corpus.K 
if ($status) then 

echo K failed ; suspend 
endif 

u_i $corpus.K $corpus . G_i . H_i . V_i $corpus .neg.f lag >$corpus.u_i 
if ($status) then 

echo u_i failed ; suspend 
endif 

sumc <$corpus.u_i >$corpus . sum.u_i 

alpha $corpus . alpha $corpus . sum.u_i $corpus . alpha. max .min >$corpus . temp 
if ($status) then 

echo alpha failed ; suspend 
endif 

mv $corpus.temp $corpus . alpha 

if ($iterid > 0) then 

awk '{print $1}' <$corpus.u_i >$corpus . tempi 

awk '{print $1}' <$corpus .u_i . $iterid >$corpus . temp2 

dif f rnc $corpus . tempi $corpus . temp2 $tolerance >> $corpus . dif f rnc 

switch ($status) 
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case 0: 
breaksw 
case 1: 

if ($iterid > $iterations) then 

echo Infinite loop ; suspend 
endif 

@ iterid++ ; goto startloop 
breaksw 
case 2: 

echo diffrnc failed ; suspend 
endsw 
else 

@ iterid++ ; goto startloop 
endif 

echo Convergence reached 
#Cleanup 

rm $corpus . G_i . H_i . V_i $corpus . sum. * $corpus . ISF_f i . f . i $corpus.K 
rm $corpus .neg.f lag $corpus . alpha. u_i . * $corpus . alpha. max .min 
rm $corpus . temp . * 
exit 



merge. csh 
#merge . csh 

#Merges the bigram and token counts for the files of a corpus, 

#and calculates the total number of tokens. 

#Input parameters are corpus name, file name pattern. 

#Input files are $file . tok . counts and $file .bigr . counts for each 

#file indicated by the file name pattern. 

#Output files are $corpus . tok . counts and $corpus .bigr . counts . 

set corpus=$argv [1] 
set f iles=$argv [2] 

echo Starting merge. csh 

#Creates corpus count files if they don't already exist, 
touch $corpus . tok . counts 
touch $corpus .bigr . counts 

#Merge file counts into corpus counts 
foreach file ($files) 

mergehist $corpus . tok . counts $file . tok . counts >$corpus . temp 
if ($status) then 

echo mergehist failed on $file . tok . counts ; suspend 
endif 

mv $corpus.temp $corpus . tok . counts 

mergehist $corpus .bigr . counts $file .bigr . counts >$corpus . temp 
if ($status) then 

echo mergehist failed on $file .bigr . counts ; suspend 
endif 

mv $corpus.temp $corpus .bigr . counts 
end 
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mergehist.c 
/* mergehist.c 

Merges two sorted histograms. 

Adapted from "Tutorial on Text Corpora", Liberman & Marcus, ACL '92. 

Handles arbitrary numbers, not just integers. 

*/ 

#include <stdio.h> 
#include <stdlib.h> 
#include <string.h> 
#include <ctype.h> 
#define MAXLIIE 2048 



main(int ac, char** av) { 

char *f indval(char*) ; 

int getline(FILE*, char*, int); 

int put line ( char* ) ; 

FILE *infdl, *infd2; 
char inl [MAXLIIE] , in2 [MAXLIIE] ; 
int explain=0; 
register char *pl, *p2; 
register int retl, ret2; 
register double nl, n2, temp; 
register int neededl, needed2; 
register int c; 
int compare ; 

if (ac!=3) explain++; 
else { 

infdl = fopen(av[l] ,"r") ; 
if(infdl == IULL){ 

fprintf(stderr, "can't open y,s\n" , av [1] ) ; 

explain++ ; 

>; 

infd2 = fopen(av[2] ,"r") ; 
if(infd2 == IULL){ 

fprintf(stderr, "can't open y,s\n" , av [2] ) ; 

explain++ ; 

>; 

>; 

if ( explain) { 

f printf (stderr, "usage : '/.s histl hist2\n", av[0]); 

fprintf (stderr, "\tinput files must be sorted histograms\n") ; 

exit(2) ; 

>; 

neededl = needed2 = 1; 
while (1) { 

if (neededl) { 

ret l=getline (infdl , inl , MAXLIIE) ; 

pl=f indval(inl) ; 

nl=atof (inl) ; 

>; 

if (needed2) { 

ret2=getline ( inf d2 , in2 , MAXLIIE) ; 
p2=f indval(in2) ; 
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n2=atof (in2) ; 

>; 

if (retl==0){ 

if (ret2==0)exit(0) ; 
else { 

putline(in2) ; 

while((c=getc(infd2)) ! =EOF)putchar(c) ; 
exit(O) ; 

>; 

> 

else if (ret2==0) { 
putline(inl) ; 

while((c=getc(infdl)) ! =EOF)putchar (c) ; 
exit(O) ; 

>; 

compare = strcmpCpl ,p2) ; 
if (compare==0){ 
temp=nl+n2 ; 

printf ("•/.. 15g y.s\n" ,temp,pl) ; 
neededl = needed2 = 1; 
continue ; 

> 

else if (compare<0) { /* must catch up in first file */ 
putline(inl) ; 
neededl=l ; 
needed2=0 ; 
continue ; 

> 

else if (compare>0) { /* must catch up in second file */ 
putline(in2) ; 
needed2=l ; 
neededl=0 ; 
continue ; 

>; 

>; 

> 

getline(FILE *fd, char *buf , int max) { 
register int c, count=0; 
register char *s=buf; 

while(l) { 
c=getc(f d) ; 
switch(c) { 
case EOF: 

if (count==0) return(O) ; 

else {*s = '\0'; return count;}; 

break; 
case ' \n' : 

if (count==0) continue; 

else {*s = '\0'; return count;}; 

break; 
default : 

if (++count > max) { 
*s = '\n' ; 

fprintf (stderr, "overflow: T/.s I \n" , buf); 
exit(2) ; 

} 

else *s++ = c; 
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break; 

>; 

>; 

> 

char *findval (register char *p) { 
while (*p && isspace(*p)) p++; 
while (*p && ! isspace(*p) ) p++; 
while (*p && isspace(*p)) p++; 
return(p) ; 



putline (register char *p) { 
while (*p) { 
put char (*p) ; 
P++; 

>; 

put char ( ' \n' ) ; 

> 



mprobs.c 

/* mprobs.c */ 
/* 

Given alpha, and a file containing the u_i's and unigram and 
bigram frequencies for bigrams occurring in the test data, it 
calculates the conditional probabilities for those bigrams, using 
the formula 

F_ji + u_i 

q = 

i I j F_ j + alpha 
where F_ji is the count of bigram ji and F_j is the count 
of token j . 

Input files are corpus. alpha and corpus . j . i . F_j i . F_j .u_i . 
Output is the probability of each bigram in the format prob,j,i 
on stdout . 
*/ 

#include <stdio.h> 
#define MAXMORDLEI 200 

main(int argc, char **argv) { 
FILE *alphafile; 
FILE *j_i_Fji_Fj_ui; 

long int F_ji, F_j ; 
double alpha, u_i; 

char j [MAXMORDLEI] , i [MAXMORDLEI] ; 
int result; 

/* Check parameters. */ 
if (argc! =3) { 

f printf (stderr, "Usage : mprobs corpus. alpha corpus . j . i . F_j i . F_j .u_i\n") ; 
return 2; 

>; 

/ * Open files . */ 

alphafile = f open(argv [1] , "r"); 

if (! alphafile) { 
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fprintf (stderr, "mprobs : cannot open y,s\n" , argv [1] ) ; 
exit(2) ; 

>; 

result = f scanf (alphaf ile, "'/.Ig", fealpha) ; 
if (! result II result == EOF) { 

fprintf (stderr, "'/.s not an alpha file", argv[l]); 

exit (2); 

>; 

f close(alphaf ile) ; 

j_i_Fji_Fj_ui = f open(argv[2] , "r"); 
if ( ! j_i_Fji_Fj_ui) { 

fprintf (stderr, "mprobs: cannot open y,s\n" , argv [2] ) ; 

exit(2) ; 

>; 

/* Calculate and print probabilities. */ 

while (fscanf (j_i_Fji_Fj_ui,"y.s y.s y.ld y.ld y.lg" , j , i ,&F_j i ,&F_j ,&u 
!= EOF) 

printf ("y..l5g y.s y.s\n",(F_ji + u_i)/(F_j + alpha), j,i); 

f close(j_i_Fji_Fj_ui) ; 
return 0; 

> 



perplex2.c 

/* perplex2.c */ 

/* 

Input is a file text. I giving the number bigrams in the test text, 
and a table of probabilities and counts for the bigrams in the 
test text. 

Output is the probability and perplexity of the text on stdout . 
*/ 

#include <stdio.h> 
#include <math.h> 
#define MAXMORDLEI 200 

main(int argc, char **argv) { 
FILE *Ifile, *probs; 
double product=1.0, I, prob; 
char i [MAXMORDLEI] , j [MAXMORDLEI] ; 
int result; 
long int count ; 

if (argc! =3) { 

fprintf (stderr , "Usage : perplex2 corpus. I test . something. probs\n 
return 1 ; 

>; 

Mile = fopen(argv[l] , "r"); 
result=f scanf (Mile, "y.lg", &I) ; 
if (! result II result==EOF) { 

fprintf (stderr, "y.s not a I file", argv[l]); 

exit (2); 

>; 

f close(Mile) ; 
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probs = f open(argv [2] , "r"); 
if ( ! probs) { 

fprintf (stderr, "perplex2: can't open y,s\n" , argv [2] ) ; 
exit(2) ; 

>; 

while ((fscanf (probs, '"/.Id y.lg %s '/.s" ,&count ,&prob, j , i) ) ! =EOF) 
product*=count*prob ; 

f close(probs) ; 

printf ( "Probability=y. . 15g\tPerplexity=y. . 15g\n" , 

product .product ?pow (product , (-1 . O/I) ) :HUGE_VAL) ; 
return 0; 

> 



perplext.csh 
#perplext . csh 

#Calculates the probability and perplexity of a test file using 

#the two different smoothing methods. 

#Input parameters are: corpus name, test file name. 

#Input files are: $corpus .bigr . counts , $corpus . tok . counts , 

#$test .bigr . counts , Itest.I, $corpus . alpha, $corpus.u_i, $corpus . lambdas , 

#$corpus .tok.freq, $corpus .bigr .freq. 

#Output files are: $corpus . delint . answer , $corpus .mackay . answer . 

set corpus=$argv [1] 
set test=$argv [2] 

echo Starting perplext.csh 

#Lambda-mixture method. 

echo Starting lambda-mixture method 

combine $corpus .bigr .freq $test .bigr . counts |\ 

awk '$2!=0{print $1,$3,$4>' >$corpus . f _j i . j . i 
if ($status) then 

echo first combine failed ; suspend 
endif 

sort +1 $corpus . tok . freq $corpus . f _j i . j . i I \ 

awk 'IF==2{f j=$l>IF==3{print $2,$3,$l,f j>' \ 

>$corpus . j . i . f _j i .f _j 
if ($status) then 

echo first sort failed ; suspend 
endif 

sort +1 $corpus . j . i . f _j i . f _j $corpus . tok . freq |\ 

awk 'IF==2{f i=$l>IF==4{print $l,$2,$3,$4,f i>' \ 
I sort >$corpus . j . i . f _ji . f _j . f _i 
if ($status) then 

echo second sort failed ; suspend 
endif 

dprobs $corpus . lambdas $corpus . j . i . f _j i . f _j . f _i >$test .delint .probs 
if ($status) then 

echo dprobs failed ; suspend 
endif 

combine $test .bigr . counts $test .delint .probs >$corpus . temp 
if ($status) then 

echo second combine failed ; suspend 
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endif 

perplex2 Itest.I $corpus . temp>$test . delint . answer 
if ($status) then 

echo perplex2 failed ; suspend 
endif 

rm $corpus . j . i . f _j i . f _j . f _i $corpus . j . i . f _j i . f _j 

#Mackay's method. 

echo Starting Mackays method 

combine $corpus .bigr. counts $test .bigr. counts I \ 

awk '$2!=0{print $1,$3,$4>' >$corpus . temp 
if ($status) then 

echo first combine failed ; suspend 
endif 

sort -m +1 $corpus.temp $corpus . tok . counts I \ 

awk 'IF==2{Fj=$l>IF==3{print $2,$3,$l,Fj>' >$corpus . j . i . F_j i 
if ($status) then 

echo first sort failed ; suspend 
endif 

sort +1 $corpus . j . i . F_j i . F_j $corpus.u_i |\ 

awk 'IF==2{mi=$l>IF==4{print $l,$2,$3,$4,mi>' |\ 
sort >$corpus . j . i . F_j i . F_j .u_i 

if ($status) then 

echo second sort failed ; suspend 

endif 

mprobs $corpus . alpha $corpus . j . i . F_j i . F_j .u_i >$test .mackay .prob 
if ($status) then 

echo mprobs failed ; suspend 
endif 

combine $test .bigr . counts $test .mackay .probs >$corpus . temp 
if ($status) then 

echo second combine failed ; suspend 
endif 

perplex2 Itest.I $corpus.temp >$test .mackay . answer 
if ($status) then 

echo perplex2 failed ; suspend 
endif 

rm $corpus . j . i . F_ j i . F_ j . u_i $corpus . j . i . F_ j i . F_ j 
rm $corpus.temp $corpus . f _j i . j . i 



sed. script 

s/' [0-9] [0-9]*\. */<s> / 
s/\n / / 
s/$/ <\/s>/ 
s/\\\${>/$ /g 
s/\\["]\([a-zA-Z]\)/\l/g 
s/\\\ ([&•/._] \){>/\l/g 
s/\([!?:;]\)/ \1 /g 
s/-/ - /g 
s/\.\.\./ . . .1 /g 
s/Mr\./Mr. 1/g 
s/Mrs\./Mrs.l/g 
s/Dr\./Dr.l/g 
s/Ms\./Ms.l/g 

s/Io\. *\([0-9]\)/Io.l \l/g 
s/ incl\ . / incl . 1/g 
s/\([ap]\)\.m\./\l.m.l/g 
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s/\([Hh]\)on\./\lon.l/g 
s/\([.,]\) / \1 /g 
s/(\(.\)\([-)]\)/( \l\2/g 
s/\([-(]\)\(.\))/\l\2 )/g 
s/\.\.\.l/. . ./g 
s/Mr. l/Mr./g 
s/Mrs.l/Mrs./g 
s/Dr.l/Dr./g 
s/Ms.l/Ms./g 
s/Io.l /Ho. /g 
s/\([Hh]\)on.l/\lon./g 
s/ incl\ . 1/ incl . /g 
s/\([ap]\)\.m\.l/\l.m./g 
s/"/ " /g 
s/V Vg 
s/ / /g 



spray, awk 

#Distributes lines of input among the named files 
BEGII{n=l> 

n==l{print $0 > "big.l"} 

n==2{print $0 > "big.2"> 

n==3{print $0 > "big.3"> 

n==4{print $0 > "big.4"> 

n==5{print $0 > "big.5"> 

n==6{print $0 > "big.6"> 

n==7{print $0 > "big.7"> 

n==8{print $0 > "big.8"> 

n==9{print $0 > "big.9"> 
{n++; if (n>9) n=l> 



sumc.c 

/* sumc.c */ 
/* 

Accepts a sequence of lines on stdin, sums the first field of 
each line, and outputs the results on stdout . 

Adapted from "Tutorial on Text Corpora", by Liberman and Marcus. 
*/ 

#include <stdio.h> 
#include <stdlib.h> 
#define MAXLIIE 2048 

mainO { 

register double sum=0; 
char inbuf [MAXLIIE] ; 

while (getline (stdin, inbuf ,MAXLIIE) ) sum += atof( inbuf); 
printf ( "7. . 15g\n" , sum) ; 

return 0; 

> 

getline (FILE *fd, char *buf , int max) { 

register int c, count=0; 
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register char *s = buf ; 



while (1) { 
c=getc(f d) ; 
switch(c) { 
case EOF: 

if (count==0) return (0); 

else {*s = '\0'; return count;}; 

break; 
case ' \n' : 

if (count==0) continue; 

else {*s = '\0'; return count;}; 

break; 
default : 

if (++count > max) { 
*s = '\n' ; 

fprintf (stderr, "overflow: T/.s I \n" , buf); 
exit(2) ; 

} 

else *s++ = c; 
break; 

} 

} 

} 



termsl.c 

/* termsl.c */ 
/* 

This program calculates the terms contributed by one text block 
to each of 1 (log likelihood) and 1' wrt lambda_h, with several 
different values of each lambda_h. These terms are used to 
determine the starting points for searching for the lambda_h's 
in delint . csh. 

Input file: corpus-f ile . j . i . F_ji . f _ji . f _j . f _i 

Output files: corpus-f ile . terms . 1_0 , corpus-f ile . terms . 1_1 , 

corpus-file .terms . lp_0 , corpus-file .terms . lp_ . 25 , 
corpus-file .terms . lp_ . 5, corpus-file .terms . lp_ . 75 , 
corpus-file .terms . lp_l 
Usage: termsl #lambdas corpus-f ile . j . i . F_ji . f_j i . f_j . f_i corpus-f ile . terms . 1_0 
corpus-file .terms . 1_1 corpus-file .terms . lp_0 corpus-file .terms . lp_ . 25 
corpus-file .terms . lp_ . 5 corpus-file .terms . lp_ . 75 corpus-file .terms . lp_l 

Note: +/- FLT_MAX is used to represent +/- Infinity. This 
should be a good enough approximation for our purposes, and it 
allows terms of Infinity to be summed using double arithmetic. 
*/ 

#include <stdio.h> 
#include <stdlib.h> 
#include <math.h> 
#include <float.h> 
#define MAXMORDLEI 80 
#define MAXLAMBDAS 200 

main(int argc, char* argv[]) { 
FILE *Fji_f ji_f j_fi; 

FILE *1_0, *1_1, *lp_0, *lp_25, *lp_5, *lp_75, *lp_l; 
int r, h; 
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long int F_ji; 

double f_ji, f_j, f_i, u_ji; 

double suml_0[MAXLAMBDAS+l] , suml_l [MAXLAMBDAS+1] ; 
double sumlp_0 [MAXLAMBDAS+1] , sumlp_25 [MAXLAMBDAS+1] ; 
double sumlp_5 [MAXLAMBDAS+1] , sumlp_75 [MAXLAMBDAS+1] ; 
double sumlp_l [MAXLAMBDAS] , interval; 
char i [MAXMORDLEI] , j [MAXMORDLEI] ; 

/* check parameters */ 
if (argc != 10) { 

fprintf (stderr, "termsl: Mrong number of parameters\n") ; 

exit(2) ; 

>; 

/* Calculate size of intervals for lambdas 1 thru r-1. 
Interval r takes up the rest. 

For simplicity, h runs from 1 to r rather than to r-1. 

*/ 

r=atoi(argv[l] ) ; 

if (r > MAXLAMBDAS) { 

fprintf (stderr, "termsl: too many lambdas \n"); 

exit(2) ; 

>; 

interval=0 . 03/r ; 

/* initialize temporary sums */ 
h=l; 

while (h<=r) { 

suml_0 [h] =suml_l [h] =sumlp_0 [h] =sumlp_25 [h] =0 ; 

sumlp_5 [h] =sumlp_75 [h] =sumlp_l [h] =0 ; 

h++; 

>; 

/* For each bigram in this block (that is, for each j,i in fji_fj_fi), 
calculate the terms that contribute to each sum using the 
specified values of lambda_h 
*/ 

if ((Fji_f ji_f j_fi=fopen(argv[2] ,"r"))==IULL) { 

fprintf (stderr, "termsl : can't open y,s\n", argv[2]); 
exit(2) ; 

>; 

while (fscanf (Fji_f ji_f j_fi,"y.s %s '/.Id '/.Ig '/.Ig , 

j ,i,&F_ji,&f_ji,&f_j ,&f_i) !=EOF) { 
h=f loor(f _j/interval + 1); 
if (h>r) h=r; 

suml_0 [h] +=f _ j i?F_ j i*log (f _ j i) : -FLT_MAX ; 
suml_l[h]+=F_ji*log(u_ji + f_ji); 
sumlp_0 [h] +=f _ j i?F_ j i*u_ j i/f _ j i : FLT_MAX ; 
sumlp_25[h]+=F_ji*u_ji/(0.25*u_ji + f_ji); 
sumlp_5[h]+=F_ji*u_ji/(0.5*u_ji + f_ji); 
sumlp_75[h]+=F_ji*u_ji/(0.75*u_ji + f_ji); 
sumlp_l [h] +=F_j i*u_j i/(u_j i + f_ji); 

>; 

f close(Fji_f ji_f j_f i) ; 

/* Mrite out the new terms */ 

if ((1_0 = fopen(argv[3] ,"w"))==IULL) { 

fprintf (stderr, "termsl : can't open y,s\n", argv[3]); 
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exit(2) ; 

>; 

if ((1_1 = fopen(argv[4] ,"w"))==ISFULL) { 

fprintf (stderr, "termsl : can't open y,s\n", argv[4]); 
exit(2) ; 

>; 

if ((lp_0 = fopen(argv[5] ,"w"))==ISFULL) { 

fprintf (stderr, "termsl : can't open y,s\n", argv[5]); 
exit(2) ; 

>; 

if ((lp_25 = fopen(argv[6] ,"w"))==IULL) { 

fprintf (stderr, "termsl : can't open y,s\n", argv[6]); 
exit(2) ; 

>; 

if ((lp_5 = fopen(argv[7] ,"w"))==IULL) { 

fprintf (stderr, "termsl : can't open y,s\n", argv[7]); 
exit(2) ; 

>; 

if ((lp_75 = fopen(argv[8] ,"w"))==IULL) { 

fprintf (stderr, "termsl : can't open y,s\n", argv[8]); 
exit(2) ; 

>; 

if ((lp_l = fopen(argv[9] ,"w"))==IULL) { 

fprintf (stderr, "termsl : can't open y,s\n", argv[9]); 
exit(2) ; 

>; 

for (h=l;h<=r;h++) { 

fprintf (l_0,"y..l5g y.d\n" , suml_0 [h] ,h); 
fprintf (l_l,"y..l5g y.d\n" , suml_l [h] ,h); 
fprintf (lp_0,"y..l5g y.d\n" , sumlp_0 [h] ,h); 
fprintf (lp_25 , '"h . 15g y.d\n" , sumlp_25 [h] , h) ; 
fprintf (lp_5,"y..l5g y.d\n" , sumlp_5 [h] ,h); 
fprintf (lp_75 , '"h . 15g y.d\n" , sumlp_75 [h] , h) ; 
fprintf (lp_l,"y..l5g y.d\n" , sumlp_l [h] ,h); 

>; 

f close(l_0) ; 
f close(l_l) ; 
f close(lp_0) ; 
f close(lp_25) ; 
f close(lp_5) ; 
f close(lp_75) ; 
f close(lp_l) ; 

return 0; 

> 



terms2.c 

/* terms2.c */ 
/* 

This program calculates the terms contributed by one text bio 
to each of 1' wrt lambda_h and 1'' wrt lambda_h, for each 
lambda_h. These terms are used as part of the search for the 
lambda_h's, which is carried on in delint.csh. 
Input files: corpus . lambdas , corpus-f ile . j . i .F_j i .f _j i .f _j . f 
Output files: corpus-f ile . terms . Ip, corpus-f ile . terms . Ipp 
Usage: terms2 corpus . lambdas corpus-f ile . j . i . F_j i . f_j i . f_j . f_ 
corpus-f ile . Ip corpus-f ile . Ipp 

*/ 
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#include <stdio.h> 
#include <math.h> 
#define MAXMORDLEI 80 
#define MAXLAMBDAS 200 

main(int argc, char* argv[]) { 
FILE *lambdas, *Fji_f ji_f j_f i; 
FILE *lp, *lpp; 
int r, h; 
long int F_ji; 

double f_j, f_i, u_ji, lambda [MAXLAMBDAS+1] , temp_double; 

double sump [MAXLAMBDAS+1] , sumpp [MAXLAMBDAS+1] , interval; 

char i [MAXMORDLEI] , j [MAXMORDLEI] , final [MAXLAMBDAS+1] , temp_char; 

/* check parameters */ 
if (argc != 5) { 

fprintf (stderr, "Usage: terms2 lambdas j . i . F_j i . f _j i . f _j .f _i Ip lpp\n"); 
exit(2) ; 

>; 

/* Load current values of all the lambdas from lambda file, 
and calculate interval. 

For simplicity, h runs from 1 to r rather than to r-1. */ 
if ((lambdas=fopen(argv[l] , "r")) == HULL) { 

fprintf (stderr , "terms2 : can't open y,s\n", argv[l]); 
exit(2) ; 

>; 

r=l; 

while (r<=MAXLAMBDAS && 

fscanf (lambdas, '"/.Ig '/.Is" ,&lambda[r] ,&final[r]) !=EOF) 
r++; 

if (r>MAXLAMBDAS && 

fscanf (lambdas, '"/.Ig '/.Is" ,&temp_double ,&temp_char) ! =EOF) { 
fprintf ( stderr , 

"terms2: Too many lambdas. Change MAXLAMBDAS and recompile . \n" ); 
exit(2) ; 

>; 

r — ; /* r now represents the number of lambdas found in the file */ 
interval=0 . 03/r ; /* size of intervals for lambdas 1 thru r-1 */ 

/* initialize temporary sums */ 
h=l; 

while (h<=r) { 

sump [h] = sumpp [h] =0 ; 
h++; 

>; 

/* For each bigram in this block (that is, for each j,i in fji_fj_fi), 
calculate the terms that contribute to 1' and 1'' using the old lambdas 
*/ 

if ((Fji_f ji_f j_fi=fopen(argv[2] ,"r"))==IULL) { 

fprintf (stderr , "terms2 : can't open y,s\n", argv[2]); 
exit(2) ; 

>; 

while (fscanf (Fji_fji_fj_fi,"y.s y.s y.ld y.lg y.lg y.lg", 

j ,i,&F_ji,&f_ji,&f_j ,&f_i) !=EOF) { 
h=f loor(f _j/interval + 1); 
if (h>r) h=r; 
ii_ji=f_i - f_ji; 
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sump [h]+=F_ji*u_ji/( lambda [h]*u_ji + f_ji); 
sumpp[h]-=F_ji*u_ji*u_ji/pow( lambda [h]*u_ji + f_ji,2); 

>; 

f close(Fji_f ji_f j_f i) ; 

/* Mrite out the new terms */ 

if (dp = fopen(argv[3] ,"w"))==ISFULL) { 

f printf (stderr , "terms2 : can't open y,s\n", argv[3]); 

exit(2) ; 

>; 

if (dpp = fopen(argv[4] ,"w"))==ISFULL) { 

f printf (stderr , "terms2 : can't open y,s\n", argv[4]); 
exit(2) ; 

>; 

for (h=l;h<=r;h++) { 

f printf dp,'"/.. 15g y.d\n" , sump [h] ,h); 
fprintf (lpp,"y..l5g y.d\n" , sumpp [h] ,h); 

>; 

f close(lp) ; 
f close(lpp) ; 

return 0; 

> 



testprep.csh 
#testprep . csh 

#Prepares a test text for the perplexity calculation. 

#Parameter is test text filename. 

#Output files are $test .bigr . counts , Itest.I. 

set test=$argv [1] 

echo Starting testprep.csh 

bigrams <$test I sort luniq -c >$test .bigr . counts 
if ($status) then 

echo bigrams failed ; suspend 
endif 

sumc <$test .bigr . counts >$test.ISF 
if ($status) then 

echo sumc failed ; suspend 
endif 



tokens. c 

/* tokens. c */ 

/* Given a sentence of tokens separated by whitespace on stdin, 
writes one token per line on stdout . 

*/ 

#include <stdio.h> 
#include <ctype.h> 

main(int argc, char **argv) { 
register int c, state; 
state=0 ; 

while ((c=getchar()) != EOF) { 
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switch(state) { 

case 0: /* reading whitespace */ 
if ( ! isspace(c) ) { 
state=l ; 
put char (c) ; 

> 

break; 

case 1: /* reading a token */ 
if (isspace(c)) { 
state=0 ; 
put char ( ' \n' ) ; 

> 

else putchar(c); 
break; 

>; 

>; 

if (state==l) putchar ( ' \n' ) ; 
return 0; 



tokfreq.c 

/* tokfreq.c */ 
/* 

Given the total number of tokens and the counts for each token, 
calculates the relative frequency of each token and prints it to 
stdout . 
*/ 

#include <stdio.h> 
#define MAXMORDLEI 200 

main(int argc, char **argv) { 
FILE *toktotal; 
FILE *tokcounts; 

long int total, count; 
double totald; 
char rest [MAXMORDLEI] ; 
int result; 

if (argc! =3) { 

f printf (stderr, "Usage : tokfreq corpus . tok . tot corpus .tok. counts\n") ; 
return 1 ; 

>; 

toktotal = f open(argv [1] , "r"); 
result=f scanf (toktotal, '"/.Id", fetotal) ; 
if (! result II result==EOF) { 

f printf (stderr, "'/.s not a total file", argv[l]); 

exit (2); 

>; 

f close(toktotal) ; 

totald= (double) total; 

tokcounts = f open(argv [2] , "r"); 

while ( (result=f scanf (tokcounts , "'/.Id '/.s" ,&count ,rest) )&&result ! =EOF) 

printf ("'/,. 15g y,s\n", count/totald, rest); 
f close(tokcounts) ; 
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return 0; 

> 



u_i.c 

/* u_i.c */ 
/* 

Calculates the {u_i} from G_i, H_i, V_i and K. 

Input: $corpus.K $corpus . G_i . H_i . V_i 

Output: $corpus .neg.f lag as a file, u_i i on stdout . 

*/ 

#include <stdio.h> 
#include <math.h> 
#define MAXMORDLEI 200 

main(int argc, char** argv) { 
FILE* K_fp; 
FILE* G_i_fp; 
FILE* neg_flag_fp; 
char i [MAXMORDLEI] ; 
long int V_i; 
double K, G_i, H_i, u_i; 
int neg_flag=0; 

if (argc! =4) { 

f printf (stderr, "usage : '/.s K G_i.V_i neg.flag\n", argv[0]); 
exit(2) ; 

>; 

/* get K */ 

if (!(K_fp=fopen(argv[l],"r"))) { 

f printf (stderr, "'/.s: error opening y,s\n" , argv [0] , argv [1] ) ; 
exit(2) ; 

>; 

fscanf (K_fp,"y.lg",&K) ; 
f close(K_fp) ; 

/* open G_i file */ 

if (!(G_i_fp=fopen(argv[2],"r"))) { 

f printf (stderr, "'/.s: error opening y,s\n" , argv [0] , argv [2] ) ; 

exit(2) ; 

>; 

/* Calculate u_i's */ 

while (fscanf (G_i_fp,"y.lg y.lg y.ld y.s" ,&G_i ,&H_i ,&V_i , i) ! =EOF) { 
u_i=(2*V_i)/(K - G_i + sqrt (pow(K-G_i , 2) + 4*H_i*V_i) ); 
if (u_i<0) neg_flag++; 
printf ( "y. . 15g y.s\n" ,u_i , i) ; 

>; 

f close(G_i_fp) ; 

if (!(neg_flag_fp=fopen(argv[3],"w"))) { 

f printf (stderr, "y.s: error opening y,s\n" , argv [0] , argv [3] ) ; 
exit(2) ; 

>; 

fprintf (neg_f lag_fp, "y.d" ,neg_f lag) ; 
f close(neg_f lag_fp) ; 
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return 0; 

> 
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B Data Samples 



Aligned Hansard before pre-processing 

10. I therefore move, seconded by the hon. member for Cape Breton-The 
Sydney (Mr. Muir) : 

11. That the matter of the affront of the Cape Breton Development 
Corporation and its president to the Standing Committee on Regional 
Development, and to two ministers of the Crown, be referred to the 
Standing Committee on Privileges and Elections. 

12. May I say, Mr. Speaker, that I have sent a copy of this to the 
chairman of the committee and to the two ministers involved. 

13. The hon. member has provided the Chair with the required notice of 
his intention to raise this matter today by way of a question of 
privilege . 

14. The hon. member appreciates that what he is proposing now is that 
this motion be given priority over all other business now before the 
House of Commons and that a debate ensue on the suggested question of 
privilege . 

15. I would hesitate very much to suggest to the hon. member that the 
grievance he has should be debated by the House at this time by way of 
a question of privilege. 

16. Perhaps he would allow the Chair to look at the matter within the 
next few hours and to study the notes the hon. member has submitted to 
the Chair for consideration before reaching a decision whether there is 
a prima facie case of privilege that would justify the putting of the 
motion proposed by the hon. member and a debate based on the hon. 
member's motion. 

17. Again I ask that he give the Chair an opportunity to look into the 
matter on his behalf and on behalf of all other members of the House. 

18. 1. 

19. For the period April 1, 1973 to January 31, 1974, what amount of 
money was expended on the operation and maintenance of the Prime 
Minister's residence at (a) 24 Sussex Drive, Ottawa (b) Harrington 
Lake, Quebec? 



Pre-processed blocks 

<s> The House met at # p.m. </s> 

<s> I therefore move , seconded by the hon. member for Cape Breton-The Sydney ( Mr 
. Muir ) : </s> 

<s> For the period April # , # to January # , # , what amount of money was expende 
d on the operation and maintenance of the Prime Minister 's residence at (a) # Sus 
sex Drive , Ottawa (b) Harrington Lake , Quebec ? </s> 

<s> Perhaps I did not hear him correctly , but I understood that starred question 
Ho. # was to be answered . </s> 

<s> Mr. Speaker , Mr. Mclnnis had performed legal services in # in respect of a pa 
rticular property in the province of lova Scotia . </s> 
<s> # . </s> 

<s> ¥ith reference to the answer to Question Ho. # of the First Session of the #th 
Parliament what is the exact purpose of the construction work listed as being carr 
ied out in the year #-# at the Prime Minister 's official Ottawa residence and for 

what reasons is such construction deemed necessary ? </s> 
<s> # . </s> 

<s> The question , therefore , is not one to which the government can respond . </ 
s> 

<s> Under the new proposals presently being considered by the Department of Transp 
ort , which provide for a $ # annual landing fee , plus a $ # minimum landing fee 
at the international airports of Montreal , Toronto , Winnipeg , Edmonton , Calgar 
y and Vancouver , how much revenue is anticipated on an annual basis ? </s> 
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Bigram conditional relative frequencies 

These are the raw conditional relative frequencies as calculated from the training data and stored 
in Scorpus.bigr.freq, so only those bigrams that occurred in the training data are represented. The 
record layout is fiy j i. 

0.000333667000333667 's inquiry 
0.000333667000333667 's insistence 
0.000667334000667334 's institution 
0.000333667000333667 's institutions 
0.000333667000333667 's instructions 
0.00767434100767434 's intention 
0.00233566900233567 's intentions 
0.00233566900233567 's interest 
0.001001001001001 's interests 
0.000667334000667334 's interference 
0.000333667000333667 's interjection 



Bigram probabilities after smoothing with MacKay's method 

Only those bigram probabilities needed for the test sample were actually calculated. 

4.39380132207744e-07 's incompetence 
0.000985968776387494 's industrial 
9.89359541597536e-07 's initial 
0.000327047703278967 's inquest 
0.00032997981831265 's inquiry 
0.00752128529734649 's intention 
0.00228955607194425 's intentions 
0.00230075999501251 's interest 
0.0016408538264022 's international 
0.000327596990651111 's invitation 
0.000332481429544508 's job 



Bigram probabilities after smoothing with deleted estimation 

2. 14743675590986e-06 's incompetence 
0.000789999620882728 's industrial 
8.85817661812819e-06 's initial 
0.000249866988970725 's inquest 
0.000278857385175508 's inquiry 
0.00577378370577555 's intention 
0.00175349801110414 's intentions 
0.00188663908997055 's interest 
0.0013034235081431 's international 
0.000252685499712857 's invitation 
0.000291607790913723 's job 
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C Notation 



a control parameter in MacKay's method: a = Ui 

D = si, . . . , s„ the training corpus 

Fj number of occurrences of word j in training corpus 

Fiy number of occurrences of bigram ji in training corpus 

fi relative frequency of i in training corpus 

fiy = conditional relative frequency of bigram ji in training corpus 

ff relative frequency of i with block k of data omitted 

f^j conditional relative frequency of ji with block k of data omitted 

T(x) = Jq°° u''~^e~"du Gamma function: in general T(x + 1) = xT(x) 

i and j generic words (types) 

ji a generic bigram 

A, Aft parameters controlling the mix in deleted estimation 

m = {nii} null measure for Dirichlet prior in MacKay's method 

N total number of bigrams in a corpus 

= log T{x) digamma function: + 1) = + ^ 

Q = predictive conditional probability matrix: qiy = Pr(i | j) 

u= {ui} collapsed parameters in MacKay's method: Ui = anii 

Wx^y yth word of xth sentence in the training corpus 

W number of distinct words in the training corpus vocabulary 
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